Message ID | 1375104595-16018-5-git-send-email-joelf@ti.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote: > In an effort to move to using Scatter gather lists of any size with > EDMA as discussed at [1] instead of placing limitations on the driver, > we work through the limitations of the EDMAC hardware to find missed > events and issue them. > > The sequence of events that require this are: > > For the scenario where MAX slots for an EDMA channel is 3: > > SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> Null > > The above SG list will have to be DMA'd in 2 sets: > > (1) SG1 -> SG2 -> SG3 -> Null > (2) SG4 -> SG5 -> SG6 -> Null > > After (1) is succesfully transferred, the events from the MMC controller > donot stop coming and are missed by the time we have setup the transfer > for (2). So here, we catch the events missed as an error condition and > issue them manually. Are you sure there wont be any effect of these missed events on the peripheral side. For example, wont McASP get into an underrun condition when it encounters a null PaRAM set? Even UART has to transmit to a particular baud so I guess it cannot wait like the way MMC/SD can. Also, wont this lead to under-utilization of the peripheral bandwith? Meaning, MMC/SD is ready with data but cannot transfer because the DMA is waiting to be set-up. Did you consider a ping-pong scheme with say three PaRAM sets per channel? That way you can keep a continuous transfer going on from the peripheral over the complete SG list. Thanks, Sekhar
Hi Sekhar, On 07/30/2013 02:05 AM, Sekhar Nori wrote: > On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote: >> In an effort to move to using Scatter gather lists of any size with >> EDMA as discussed at [1] instead of placing limitations on the driver, >> we work through the limitations of the EDMAC hardware to find missed >> events and issue them. >> >> The sequence of events that require this are: >> >> For the scenario where MAX slots for an EDMA channel is 3: >> >> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> Null >> >> The above SG list will have to be DMA'd in 2 sets: >> >> (1) SG1 -> SG2 -> SG3 -> Null >> (2) SG4 -> SG5 -> SG6 -> Null >> >> After (1) is succesfully transferred, the events from the MMC controller >> donot stop coming and are missed by the time we have setup the transfer >> for (2). So here, we catch the events missed as an error condition and >> issue them manually. > > Are you sure there wont be any effect of these missed events on the > peripheral side. For example, wont McASP get into an underrun condition > when it encounters a null PaRAM set? Even UART has to transmit to a But it will not encounter null PaRAM set because McASP uses contiguous buffers for transfer which are not scattered across physical memory. This can be accomplished with an SG of size 1. For such SGs, this patch series leaves it linked Dummy and does not link to Null set. Null set is only used for SG lists that are > MAX_NR_SG in size such as those created for example by MMC and Crypto. > particular baud so I guess it cannot wait like the way MMC/SD can. Existing driver have to wait anyway if they hit MAX SG limit today. If they don't want to wait, they would have allocated a contiguous block of memory and DMA that in one stretch so they don't lose any events, and in such cases we are not linking to Null. > Also, wont this lead to under-utilization of the peripheral bandwith? > Meaning, MMC/SD is ready with data but cannot transfer because the DMA > is waiting to be set-up. But it is waiting anyway even today. Currently based on MAX segs, MMC driver/subsystem will make SG list of size max_segs. Between these sessions of creating such smaller SG-lists, if for some reason the MMC controller is sending events, these will be lost anyway. What will happen now with this patch series is we are simply accepting a bigger list than this, and handling all the max_segs stuff within the EDMA driver itself without outside world knowing. This is actually more efficient as for long transfers, we are not going back and forth much between the client and EDMA driver. > Did you consider a ping-pong scheme with say three PaRAM sets per > channel? That way you can keep a continuous transfer going on from the > peripheral over the complete SG list. Do you mean ping-pong scheme as used in the davinci-pcm driver today? This can be used only for buffers that are contiguous in memory, not those that are scattered across memory. Thanks, -Joel
On Wednesday 31 July 2013 10:19 AM, Joel Fernandes wrote: > Hi Sekhar, > > On 07/30/2013 02:05 AM, Sekhar Nori wrote: >> On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote: >>> In an effort to move to using Scatter gather lists of any size with >>> EDMA as discussed at [1] instead of placing limitations on the driver, >>> we work through the limitations of the EDMAC hardware to find missed >>> events and issue them. >>> >>> The sequence of events that require this are: >>> >>> For the scenario where MAX slots for an EDMA channel is 3: >>> >>> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> Null >>> >>> The above SG list will have to be DMA'd in 2 sets: >>> >>> (1) SG1 -> SG2 -> SG3 -> Null >>> (2) SG4 -> SG5 -> SG6 -> Null >>> >>> After (1) is succesfully transferred, the events from the MMC controller >>> donot stop coming and are missed by the time we have setup the transfer >>> for (2). So here, we catch the events missed as an error condition and >>> issue them manually. >> >> Are you sure there wont be any effect of these missed events on the >> peripheral side. For example, wont McASP get into an underrun condition >> when it encounters a null PaRAM set? Even UART has to transmit to a > > But it will not encounter null PaRAM set because McASP uses contiguous > buffers for transfer which are not scattered across physical memory. > This can be accomplished with an SG of size 1. For such SGs, this patch > series leaves it linked Dummy and does not link to Null set. Null set is > only used for SG lists that are > MAX_NR_SG in size such as those > created for example by MMC and Crypto. > >> particular baud so I guess it cannot wait like the way MMC/SD can. > > Existing driver have to wait anyway if they hit MAX SG limit today. If > they don't want to wait, they would have allocated a contiguous block of > memory and DMA that in one stretch so they don't lose any events, and in > such cases we are not linking to Null. As long as DMA driver can advertize its MAX SG limit, peripherals can always work around that by limiting the number of sync events they generate so as to not having any of the events getting missed. With this series, I am worried that EDMA drivers is advertizing that it can handle any length SG list while not taking care of missing any events while doing so. This will break the assumptions that driver writers make. > >> Also, wont this lead to under-utilization of the peripheral bandwith? >> Meaning, MMC/SD is ready with data but cannot transfer because the DMA >> is waiting to be set-up. > > But it is waiting anyway even today. Currently based on MAX segs, MMC > driver/subsystem will make SG list of size max_segs. Between these > sessions of creating such smaller SG-lists, if for some reason the MMC > controller is sending events, these will be lost anyway. But if MMC/SD driver knows how many events it should generate if it knows the MAX SG limit. So there should not be any missed events in current code. And I am not claiming that your solution is making matters worse. But its not making it much better as well. > > What will happen now with this patch series is we are simply accepting a > bigger list than this, and handling all the max_segs stuff within the > EDMA driver itself without outside world knowing. This is actually more > efficient as for long transfers, we are not going back and forth much > between the client and EDMA driver. Agreed, I am not debating that we need to handle SG lists of any length. The hardware is capable of handling them, and no reason kernel should not. > >> Did you consider a ping-pong scheme with say three PaRAM sets per >> channel? That way you can keep a continuous transfer going on from the >> peripheral over the complete SG list. > > Do you mean ping-pong scheme as used in the davinci-pcm driver today? No. AFAIR, thats a ping-pong between internal RAM and DDR for earlier audio ports which did not come with FIFO. > This can be used only for buffers that are contiguous in memory, not > those that are scattered across memory. I was hinting at using the linking facility of EDMA to achieve this. Each PaRAM set has full 32-bit source and destination pointers so I see no reason why non-contiguous case cannot be handled. Lets say you need to transfer SG[0..6] on channel C. Now, PaRAM sets are typically 4 times the number of channels. In this case we use one DMA PaRAM set and two Link PaRAM sets per channel. P0 is the DMA PaRAM set and P1 and P2 are the Link sets. Initial setup: SG0 -> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL ^ ^ ^ | | | P0 -> P1 -> P2 -> NULL P[0..2].TCINTEN = 1, so get an interrupt after each SG element completion. On each completion interrupt, hardware automatically copies the linked PaRAM set into the DMA PaRAM set so after SG0 is transferred out, the state of hardware is: SG1 -> SG2 -> SG3 -> SG3 -> SG6 -> NULL ^ ^ | | P0,1 P2 -> NULL | ^ | | --------- SG1 transfer has already started by the time the TC interrupt is handled. As you can see P1 is now redundant and ready to be recycled. So in the interrupt handler, software recycles P1. Thus: SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL ^ ^ ^ | | | P0 -> P2 -> P1 -> NULL Now, on next interrupt, P2 gets copied and thus can get recycled. Hardware state: SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL ^ ^ | | P0,2 P1 -> NULL | ^ | | --------- As part of TC completion interrupt handling: SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL ^ ^ ^ | | | P0 -> P1 -> P2 -> NULL This goes on until the SG list in exhausted. If you use more PaRAM sets, interrupt handler gets more time to recycle the PaRAM set. At no point we touch P0 as it is always under active transfer. Thus the peripheral is always kept busy. Do you see any reason why such a mechanism cannot be implemented? Thanks, Sekhar
On 07/31/2013 04:18 AM, Sekhar Nori wrote: > On Wednesday 31 July 2013 10:19 AM, Joel Fernandes wrote: >> Hi Sekhar, >> >> On 07/30/2013 02:05 AM, Sekhar Nori wrote: >>> On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote: >>>> In an effort to move to using Scatter gather lists of any size with >>>> EDMA as discussed at [1] instead of placing limitations on the driver, >>>> we work through the limitations of the EDMAC hardware to find missed >>>> events and issue them. >>>> >>>> The sequence of events that require this are: >>>> >>>> For the scenario where MAX slots for an EDMA channel is 3: >>>> >>>> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> Null >>>> >>>> The above SG list will have to be DMA'd in 2 sets: >>>> >>>> (1) SG1 -> SG2 -> SG3 -> Null >>>> (2) SG4 -> SG5 -> SG6 -> Null >>>> >>>> After (1) is succesfully transferred, the events from the MMC controller >>>> donot stop coming and are missed by the time we have setup the transfer >>>> for (2). So here, we catch the events missed as an error condition and >>>> issue them manually. >>> >>> Are you sure there wont be any effect of these missed events on the >>> peripheral side. For example, wont McASP get into an underrun condition >>> when it encounters a null PaRAM set? Even UART has to transmit to a >> >> But it will not encounter null PaRAM set because McASP uses contiguous >> buffers for transfer which are not scattered across physical memory. >> This can be accomplished with an SG of size 1. For such SGs, this patch >> series leaves it linked Dummy and does not link to Null set. Null set is >> only used for SG lists that are > MAX_NR_SG in size such as those >> created for example by MMC and Crypto. >> >>> particular baud so I guess it cannot wait like the way MMC/SD can. >> >> Existing driver have to wait anyway if they hit MAX SG limit today. If >> they don't want to wait, they would have allocated a contiguous block of >> memory and DMA that in one stretch so they don't lose any events, and in >> such cases we are not linking to Null. > > As long as DMA driver can advertize its MAX SG limit, peripherals can > always work around that by limiting the number of sync events they > generate so as to not having any of the events getting missed. With this > series, I am worried that EDMA drivers is advertizing that it can handle > any length SG list while not taking care of missing any events while > doing so. This will break the assumptions that driver writers make. This is already being done by some other DMA engine drivers ;). We can advertise more than we can handle at a time, that's the basis of this whole idea. I understand what you're saying but events are not something that have be serviced immediately, they can be queued etc and the actually transfer from the DMA controller can be delayed. As long as we don't miss the event we are fine which my series takes care off. So far I have tested this series on following modules in various configurations and have seen no issues: - Crypto AES - MMC/SD - SPI (128x160 display) >>> Also, wont this lead to under-utilization of the peripheral bandwith? >>> Meaning, MMC/SD is ready with data but cannot transfer because the DMA >>> is waiting to be set-up. >> >> But it is waiting anyway even today. Currently based on MAX segs, MMC >> driver/subsystem will make SG list of size max_segs. Between these >> sessions of creating such smaller SG-lists, if for some reason the MMC >> controller is sending events, these will be lost anyway. > > But if MMC/SD driver knows how many events it should generate if it > knows the MAX SG limit. So there should not be any missed events in > current code. And I am not claiming that your solution is making matters > worse. But its not making it much better as well. This is not true for crypto, the events are not deasserted and crypto continues to send events. This is what led to the "don't trigger in Null" patch where I'm setting the missed flag to avoid recursion. >> This can be used only for buffers that are contiguous in memory, not >> those that are scattered across memory. > > I was hinting at using the linking facility of EDMA to achieve this. > Each PaRAM set has full 32-bit source and destination pointers so I see > no reason why non-contiguous case cannot be handled. > > Lets say you need to transfer SG[0..6] on channel C. Now, PaRAM sets are > typically 4 times the number of channels. In this case we use one DMA > PaRAM set and two Link PaRAM sets per channel. P0 is the DMA PaRAM set > and P1 and P2 are the Link sets. > > Initial setup: > > SG0 -> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL > ^ ^ ^ > | | | > P0 -> P1 -> P2 -> NULL > > P[0..2].TCINTEN = 1, so get an interrupt after each SG element > completion. On each completion interrupt, hardware automatically copies > the linked PaRAM set into the DMA PaRAM set so after SG0 is transferred > out, the state of hardware is: > > SG1 -> SG2 -> SG3 -> SG3 -> SG6 -> NULL > ^ ^ > | | > P0,1 P2 -> NULL > | ^ > | | > --------- > > SG1 transfer has already started by the time the TC interrupt is > handled. As you can see P1 is now redundant and ready to be recycled. So > in the interrupt handler, software recycles P1. Thus: > > SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL > ^ ^ ^ > | | | > P0 -> P2 -> P1 -> NULL > > Now, on next interrupt, P2 gets copied and thus can get recycled. > Hardware state: > > SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL > ^ ^ > | | > P0,2 P1 -> NULL > | ^ > | | > --------- > > As part of TC completion interrupt handling: > > SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL > ^ ^ ^ > | | | > P0 -> P1 -> P2 -> NULL > > This goes on until the SG list in exhausted. If you use more PaRAM sets, > interrupt handler gets more time to recycle the PaRAM set. At no point > we touch P0 as it is always under active transfer. Thus the peripheral > is always kept busy. > > Do you see any reason why such a mechanism cannot be implemented? This is possible and looks like another way to do it, but there are 2 problems I can see with it. 1. Its inefficient because of too many interrupts: Imagine case where we have an SG list of size 30 and MAX_NR_SG size is 10. This method will trigger 30 interrupts always, where as with my patch series, you'd get only 3 interrupts. If you increase MAX_SG_NR , you'd get even fewer interrupts. 2. If the interrupt handler for some reason doesn't complete or get service in time, we will end up DMA'ing incorrect data as events wouldn't stop coming in even if interrupt is not yet handled (in your example linked sets P1 or P2 would be old ones being repeated). Where as with my method, we are not doing any DMA once we finish the current MAX_NR_SG set even if events continue to come. I feel my patch series efficient, has less LOC because of code reuse and has passed all possible tests I've performed on it. Thanks, -Joel > > Thanks, > Sekhar > -- > To unsubscribe from this list: send the line "unsubscribe linux-omap" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
On 07/31/2013 09:27 PM, Joel Fernandes wrote: > On 07/31/2013 04:18 AM, Sekhar Nori wrote: >> On Wednesday 31 July 2013 10:19 AM, Joel Fernandes wrote: >>> Hi Sekhar, >>> >>> On 07/30/2013 02:05 AM, Sekhar Nori wrote: >>>> On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote: >>>>> In an effort to move to using Scatter gather lists of any size with >>>>> EDMA as discussed at [1] instead of placing limitations on the driver, >>>>> we work through the limitations of the EDMAC hardware to find missed >>>>> events and issue them. >>>>> >>>>> The sequence of events that require this are: >>>>> >>>>> For the scenario where MAX slots for an EDMA channel is 3: >>>>> >>>>> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> Null >>>>> >>>>> The above SG list will have to be DMA'd in 2 sets: >>>>> >>>>> (1) SG1 -> SG2 -> SG3 -> Null >>>>> (2) SG4 -> SG5 -> SG6 -> Null >>>>> >>>>> After (1) is succesfully transferred, the events from the MMC controller >>>>> donot stop coming and are missed by the time we have setup the transfer >>>>> for (2). So here, we catch the events missed as an error condition and >>>>> issue them manually. >>>> >>>> Are you sure there wont be any effect of these missed events on the >>>> peripheral side. For example, wont McASP get into an underrun condition >>>> when it encounters a null PaRAM set? Even UART has to transmit to a >>> >>> But it will not encounter null PaRAM set because McASP uses contiguous >>> buffers for transfer which are not scattered across physical memory. >>> This can be accomplished with an SG of size 1. For such SGs, this patch >>> series leaves it linked Dummy and does not link to Null set. Null set is >>> only used for SG lists that are > MAX_NR_SG in size such as those >>> created for example by MMC and Crypto. >>> >>>> particular baud so I guess it cannot wait like the way MMC/SD can. >>> >>> Existing driver have to wait anyway if they hit MAX SG limit today. If >>> they don't want to wait, they would have allocated a contiguous block of >>> memory and DMA that in one stretch so they don't lose any events, and in >>> such cases we are not linking to Null. >> >> As long as DMA driver can advertize its MAX SG limit, peripherals can >> always work around that by limiting the number of sync events they >> generate so as to not having any of the events getting missed. With this >> series, I am worried that EDMA drivers is advertizing that it can handle >> any length SG list while not taking care of missing any events while >> doing so. This will break the assumptions that driver writers make. Sorry, just forgot to respond to "not taking care of missing any events while doing so". Can you clarify this? DMA engine driver is taking care of missed events. Also- missing of events doesn't result in feedback to the peripheral. Peripheral sends even to DMA controller, event is missed. Peripheral doesn't know anything about what happened and is waiting for transfer from the DMA controller. Thanks, -Joel
On 07/31/2013 09:27 PM, Joel Fernandes wrote: > On 07/31/2013 04:18 AM, Sekhar Nori wrote: >> On Wednesday 31 July 2013 10:19 AM, Joel Fernandes wrote: >>> Hi Sekhar, >>> >>> On 07/30/2013 02:05 AM, Sekhar Nori wrote: >>>> On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote: >>>>> In an effort to move to using Scatter gather lists of any size with >>>>> EDMA as discussed at [1] instead of placing limitations on the driver, >>>>> we work through the limitations of the EDMAC hardware to find missed >>>>> events and issue them. >>>>> >>>>> The sequence of events that require this are: >>>>> >>>>> For the scenario where MAX slots for an EDMA channel is 3: >>>>> >>>>> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> Null >>>>> >>>>> The above SG list will have to be DMA'd in 2 sets: >>>>> >>>>> (1) SG1 -> SG2 -> SG3 -> Null >>>>> (2) SG4 -> SG5 -> SG6 -> Null >>>>> >>>>> After (1) is succesfully transferred, the events from the MMC controller >>>>> donot stop coming and are missed by the time we have setup the transfer >>>>> for (2). So here, we catch the events missed as an error condition and >>>>> issue them manually. >>>> >>>> Are you sure there wont be any effect of these missed events on the >>>> peripheral side. For example, wont McASP get into an underrun condition >>>> when it encounters a null PaRAM set? Even UART has to transmit to a >>> >>> But it will not encounter null PaRAM set because McASP uses contiguous >>> buffers for transfer which are not scattered across physical memory. >>> This can be accomplished with an SG of size 1. For such SGs, this patch >>> series leaves it linked Dummy and does not link to Null set. Null set is >>> only used for SG lists that are > MAX_NR_SG in size such as those >>> created for example by MMC and Crypto. >>> >>>> particular baud so I guess it cannot wait like the way MMC/SD can. >>> >>> Existing driver have to wait anyway if they hit MAX SG limit today. If >>> they don't want to wait, they would have allocated a contiguous block of >>> memory and DMA that in one stretch so they don't lose any events, and in >>> such cases we are not linking to Null. >> >> As long as DMA driver can advertize its MAX SG limit, peripherals can >> always work around that by limiting the number of sync events they >> generate so as to not having any of the events getting missed. With this >> series, I am worried that EDMA drivers is advertizing that it can handle >> any length SG list while not taking care of missing any events while >> doing so. This will break the assumptions that driver writers make. > > This is already being done by some other DMA engine drivers ;). We can > advertise more than we can handle at a time, that's the basis of this > whole idea. > > I understand what you're saying but events are not something that have > be serviced immediately, they can be queued etc and the actually > transfer from the DMA controller can be delayed. As long as we don't > miss the event we are fine which my series takes care off. > > So far I have tested this series on following modules in various > configurations and have seen no issues: > - Crypto AES > - MMC/SD > - SPI (128x160 display) > >>>> Also, wont this lead to under-utilization of the peripheral bandwith? >>>> Meaning, MMC/SD is ready with data but cannot transfer because the DMA >>>> is waiting to be set-up. >>> >>> But it is waiting anyway even today. Currently based on MAX segs, MMC >>> driver/subsystem will make SG list of size max_segs. Between these >>> sessions of creating such smaller SG-lists, if for some reason the MMC >>> controller is sending events, these will be lost anyway. >> >> But if MMC/SD driver knows how many events it should generate if it >> knows the MAX SG limit. So there should not be any missed events in >> current code. And I am not claiming that your solution is making matters >> worse. But its not making it much better as well. > > This is not true for crypto, the events are not deasserted and crypto > continues to send events. This is what led to the "don't trigger in > Null" patch where I'm setting the missed flag to avoid recursion. > >>> This can be used only for buffers that are contiguous in memory, not >>> those that are scattered across memory. >> >> I was hinting at using the linking facility of EDMA to achieve this. >> Each PaRAM set has full 32-bit source and destination pointers so I see >> no reason why non-contiguous case cannot be handled. >> >> Lets say you need to transfer SG[0..6] on channel C. Now, PaRAM sets are >> typically 4 times the number of channels. In this case we use one DMA >> PaRAM set and two Link PaRAM sets per channel. P0 is the DMA PaRAM set >> and P1 and P2 are the Link sets. >> >> Initial setup: >> >> SG0 -> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >> ^ ^ ^ >> | | | >> P0 -> P1 -> P2 -> NULL >> >> P[0..2].TCINTEN = 1, so get an interrupt after each SG element >> completion. On each completion interrupt, hardware automatically copies >> the linked PaRAM set into the DMA PaRAM set so after SG0 is transferred >> out, the state of hardware is: >> >> SG1 -> SG2 -> SG3 -> SG3 -> SG6 -> NULL >> ^ ^ >> | | >> P0,1 P2 -> NULL >> | ^ >> | | >> --------- >> >> SG1 transfer has already started by the time the TC interrupt is >> handled. As you can see P1 is now redundant and ready to be recycled. So >> in the interrupt handler, software recycles P1. Thus: >> >> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >> ^ ^ ^ >> | | | >> P0 -> P2 -> P1 -> NULL >> >> Now, on next interrupt, P2 gets copied and thus can get recycled. >> Hardware state: >> >> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >> ^ ^ >> | | >> P0,2 P1 -> NULL >> | ^ >> | | >> --------- >> >> As part of TC completion interrupt handling: >> >> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >> ^ ^ ^ >> | | | >> P0 -> P1 -> P2 -> NULL >> >> This goes on until the SG list in exhausted. If you use more PaRAM sets, >> interrupt handler gets more time to recycle the PaRAM set. At no point >> we touch P0 as it is always under active transfer. Thus the peripheral >> is always kept busy. >> >> Do you see any reason why such a mechanism cannot be implemented? > > This is possible and looks like another way to do it, but there are 2 > problems I can see with it. > > 1. Its inefficient because of too many interrupts: > > Imagine case where we have an SG list of size 30 and MAX_NR_SG size is > 10. This method will trigger 30 interrupts always, where as with my > patch series, you'd get only 3 interrupts. If you increase MAX_SG_NR , > you'd get even fewer interrupts. > > 2. If the interrupt handler for some reason doesn't complete or get > service in time, we will end up DMA'ing incorrect data as events > wouldn't stop coming in even if interrupt is not yet handled (in your > example linked sets P1 or P2 would be old ones being repeated). Where as > with my method, we are not doing any DMA once we finish the current > MAX_NR_SG set even if events continue to come. > Actually on second thought, 1. can be tackled by having a list of PaRAM set instead of just 1 set for P1, and another list in P2. And ping-pong between the P1 and P2 sets only interrupting in between to setup one or the other. However 2. is still a concern. Still, what you're asking for is a rewrite of quite a bit of the driver which I feel is unnecessary at this point as my patch series is alternate method that's been tested and working. The only point of concern I think you have with the series is how will peripherals react if their events are not handled right away. I am certain that the peripheral doesn't go into an error condition state because it doesn't know that its event was missed and it'd be just waiting. I haven't dealt with EDMA queuing but wouldn't this kind of wait be happening even with such queues. Another note is, the waiting is happening even in today's state of the driver where we limit by MAX_NR_SG. Probably not the same kind of wait as this series (like send event and wait), but peripheral just not doing anything. Let me know if you had a specific case of a peripheral where this could be a problem and I'll be happy to test it. Thanks. Regards, -Joel
On Thursday 01 August 2013 07:57 AM, Joel Fernandes wrote: > On 07/31/2013 04:18 AM, Sekhar Nori wrote: >> On Wednesday 31 July 2013 10:19 AM, Joel Fernandes wrote: >>> Hi Sekhar, >>> >>> On 07/30/2013 02:05 AM, Sekhar Nori wrote: >>>> On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote: >>>>> In an effort to move to using Scatter gather lists of any size with >>>>> EDMA as discussed at [1] instead of placing limitations on the driver, >>>>> we work through the limitations of the EDMAC hardware to find missed >>>>> events and issue them. >>>>> >>>>> The sequence of events that require this are: >>>>> >>>>> For the scenario where MAX slots for an EDMA channel is 3: >>>>> >>>>> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> Null >>>>> >>>>> The above SG list will have to be DMA'd in 2 sets: >>>>> >>>>> (1) SG1 -> SG2 -> SG3 -> Null >>>>> (2) SG4 -> SG5 -> SG6 -> Null >>>>> >>>>> After (1) is succesfully transferred, the events from the MMC controller >>>>> donot stop coming and are missed by the time we have setup the transfer >>>>> for (2). So here, we catch the events missed as an error condition and >>>>> issue them manually. >>>> >>>> Are you sure there wont be any effect of these missed events on the >>>> peripheral side. For example, wont McASP get into an underrun condition >>>> when it encounters a null PaRAM set? Even UART has to transmit to a >>> >>> But it will not encounter null PaRAM set because McASP uses contiguous >>> buffers for transfer which are not scattered across physical memory. >>> This can be accomplished with an SG of size 1. For such SGs, this patch >>> series leaves it linked Dummy and does not link to Null set. Null set is >>> only used for SG lists that are > MAX_NR_SG in size such as those >>> created for example by MMC and Crypto. >>> >>>> particular baud so I guess it cannot wait like the way MMC/SD can. >>> >>> Existing driver have to wait anyway if they hit MAX SG limit today. If >>> they don't want to wait, they would have allocated a contiguous block of >>> memory and DMA that in one stretch so they don't lose any events, and in >>> such cases we are not linking to Null. >> >> As long as DMA driver can advertize its MAX SG limit, peripherals can >> always work around that by limiting the number of sync events they >> generate so as to not having any of the events getting missed. With this >> series, I am worried that EDMA drivers is advertizing that it can handle >> any length SG list while not taking care of missing any events while >> doing so. This will break the assumptions that driver writers make. > > This is already being done by some other DMA engine drivers ;). We can > advertise more than we can handle at a time, that's the basis of this > whole idea. > > I understand what you're saying but events are not something that have > be serviced immediately, they can be queued etc and the actually > transfer from the DMA controller can be delayed. As long as we don't > miss the event we are fine which my series takes care off. > > So far I have tested this series on following modules in various > configurations and have seen no issues: > - Crypto AES > - MMC/SD > - SPI (128x160 display) Notice how in each of these cases the peripheral is in control of when data is driven out? Please test with McASP in a configuration where codec drives the frame-sync/bit-clock or with UART under high baud rate. > >>>> Also, wont this lead to under-utilization of the peripheral bandwith? >>>> Meaning, MMC/SD is ready with data but cannot transfer because the DMA >>>> is waiting to be set-up. >>> >>> But it is waiting anyway even today. Currently based on MAX segs, MMC >>> driver/subsystem will make SG list of size max_segs. Between these >>> sessions of creating such smaller SG-lists, if for some reason the MMC >>> controller is sending events, these will be lost anyway. >> >> But if MMC/SD driver knows how many events it should generate if it >> knows the MAX SG limit. So there should not be any missed events in >> current code. And I am not claiming that your solution is making matters >> worse. But its not making it much better as well. > > This is not true for crypto, the events are not deasserted and crypto > continues to send events. This is what led to the "don't trigger in > Null" patch where I'm setting the missed flag to avoid recursion. Sorry, I am not sure which patch you are talking about here. Can you provide the full subject line to avoid confusion? >>> This can be used only for buffers that are contiguous in memory, not >>> those that are scattered across memory. >> >> I was hinting at using the linking facility of EDMA to achieve this. >> Each PaRAM set has full 32-bit source and destination pointers so I see >> no reason why non-contiguous case cannot be handled. >> >> Lets say you need to transfer SG[0..6] on channel C. Now, PaRAM sets are >> typically 4 times the number of channels. In this case we use one DMA >> PaRAM set and two Link PaRAM sets per channel. P0 is the DMA PaRAM set >> and P1 and P2 are the Link sets. >> >> Initial setup: >> >> SG0 -> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >> ^ ^ ^ >> | | | >> P0 -> P1 -> P2 -> NULL >> >> P[0..2].TCINTEN = 1, so get an interrupt after each SG element >> completion. On each completion interrupt, hardware automatically copies >> the linked PaRAM set into the DMA PaRAM set so after SG0 is transferred >> out, the state of hardware is: >> >> SG1 -> SG2 -> SG3 -> SG3 -> SG6 -> NULL >> ^ ^ >> | | >> P0,1 P2 -> NULL >> | ^ >> | | >> --------- >> >> SG1 transfer has already started by the time the TC interrupt is >> handled. As you can see P1 is now redundant and ready to be recycled. So >> in the interrupt handler, software recycles P1. Thus: >> >> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >> ^ ^ ^ >> | | | >> P0 -> P2 -> P1 -> NULL >> >> Now, on next interrupt, P2 gets copied and thus can get recycled. >> Hardware state: >> >> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >> ^ ^ >> | | >> P0,2 P1 -> NULL >> | ^ >> | | >> --------- >> >> As part of TC completion interrupt handling: >> >> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >> ^ ^ ^ >> | | | >> P0 -> P1 -> P2 -> NULL >> >> This goes on until the SG list in exhausted. If you use more PaRAM sets, >> interrupt handler gets more time to recycle the PaRAM set. At no point >> we touch P0 as it is always under active transfer. Thus the peripheral >> is always kept busy. >> >> Do you see any reason why such a mechanism cannot be implemented? > > This is possible and looks like another way to do it, but there are 2 > problems I can see with it. > > 1. Its inefficient because of too many interrupts: > > Imagine case where we have an SG list of size 30 and MAX_NR_SG size is > 10. This method will trigger 30 interrupts always, where as with my > patch series, you'd get only 3 interrupts. If you increase MAX_SG_NR , > you'd get even fewer interrupts. Yes, but you are seeing only one side of inefficiency. In your design DMA *always* stalls waiting for CPU to intervene. The whole point to DMA is to keep it going while CPU does bookeeping in background. This is simply not going to scale with fast peripherals. Besides, missed events are error conditions as far as EDMA and the peripheral is considered. You are handling error interrupt to support a successful transaction. Think about why EDMA considers missed events as error condition. > > 2. If the interrupt handler for some reason doesn't complete or get > service in time, we will end up DMA'ing incorrect data as events > wouldn't stop coming in even if interrupt is not yet handled (in your > example linked sets P1 or P2 would be old ones being repeated). Where as > with my method, we are not doing any DMA once we finish the current > MAX_NR_SG set even if events continue to come. Where is repetition and possibility of wrong data being transferred? We have a linear list of PaRAM sets - not a loop. You would link the end to PaRAM set chain to dummy PaRAM set which BTW will not cause missed events. The more number of PaRAM sets you add to the chain, the more time CPU gets to intervene before DMA eventually stalls. This is a tradeoff system designers can manage. Thanks, Sekhar
On 08/01/2013 01:13 AM, Sekhar Nori wrote: > On Thursday 01 August 2013 07:57 AM, Joel Fernandes wrote: >> On 07/31/2013 04:18 AM, Sekhar Nori wrote: >>> On Wednesday 31 July 2013 10:19 AM, Joel Fernandes wrote: >>>> Hi Sekhar, >>>> >>>> On 07/30/2013 02:05 AM, Sekhar Nori wrote: >>>>> On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote: >>>>>> In an effort to move to using Scatter gather lists of any size with >>>>>> EDMA as discussed at [1] instead of placing limitations on the driver, >>>>>> we work through the limitations of the EDMAC hardware to find missed >>>>>> events and issue them. >>>>>> >>>>>> The sequence of events that require this are: >>>>>> >>>>>> For the scenario where MAX slots for an EDMA channel is 3: >>>>>> >>>>>> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> Null >>>>>> >>>>>> The above SG list will have to be DMA'd in 2 sets: >>>>>> >>>>>> (1) SG1 -> SG2 -> SG3 -> Null >>>>>> (2) SG4 -> SG5 -> SG6 -> Null >>>>>> >>>>>> After (1) is succesfully transferred, the events from the MMC controller >>>>>> donot stop coming and are missed by the time we have setup the transfer >>>>>> for (2). So here, we catch the events missed as an error condition and >>>>>> issue them manually. >>>>> >>>>> Are you sure there wont be any effect of these missed events on the >>>>> peripheral side. For example, wont McASP get into an underrun condition >>>>> when it encounters a null PaRAM set? Even UART has to transmit to a >>>> >>>> But it will not encounter null PaRAM set because McASP uses contiguous >>>> buffers for transfer which are not scattered across physical memory. >>>> This can be accomplished with an SG of size 1. For such SGs, this patch >>>> series leaves it linked Dummy and does not link to Null set. Null set is >>>> only used for SG lists that are > MAX_NR_SG in size such as those >>>> created for example by MMC and Crypto. >>>> >>>>> particular baud so I guess it cannot wait like the way MMC/SD can. >>>> >>>> Existing driver have to wait anyway if they hit MAX SG limit today. If >>>> they don't want to wait, they would have allocated a contiguous block of >>>> memory and DMA that in one stretch so they don't lose any events, and in >>>> such cases we are not linking to Null. >>> >>> As long as DMA driver can advertize its MAX SG limit, peripherals can >>> always work around that by limiting the number of sync events they >>> generate so as to not having any of the events getting missed. With this >>> series, I am worried that EDMA drivers is advertizing that it can handle >>> any length SG list while not taking care of missing any events while >>> doing so. This will break the assumptions that driver writers make. >> >> This is already being done by some other DMA engine drivers ;). We can >> advertise more than we can handle at a time, that's the basis of this >> whole idea. >> >> I understand what you're saying but events are not something that have >> be serviced immediately, they can be queued etc and the actually >> transfer from the DMA controller can be delayed. As long as we don't >> miss the event we are fine which my series takes care off. >> >> So far I have tested this series on following modules in various >> configurations and have seen no issues: >> - Crypto AES >> - MMC/SD >> - SPI (128x160 display) > > Notice how in each of these cases the peripheral is in control of when > data is driven out? Please test with McASP in a configuration where > codec drives the frame-sync/bit-clock or with UART under high baud rate. McASP allocates a contiguous buffer. For this case there is always an SG of size 1 and this patch series doesn't effect it at all, there is not stalling. Further McASP audio driver is still awaiting conversion to use DMA engine so there's no way yet to test it. >>>>> Also, wont this lead to under-utilization of the peripheral bandwith? >>>>> Meaning, MMC/SD is ready with data but cannot transfer because the DMA >>>>> is waiting to be set-up. >>>> >>>> But it is waiting anyway even today. Currently based on MAX segs, MMC >>>> driver/subsystem will make SG list of size max_segs. Between these >>>> sessions of creating such smaller SG-lists, if for some reason the MMC >>>> controller is sending events, these will be lost anyway. >>> >>> But if MMC/SD driver knows how many events it should generate if it >>> knows the MAX SG limit. So there should not be any missed events in >>> current code. And I am not claiming that your solution is making matters >>> worse. But its not making it much better as well. >> >> This is not true for crypto, the events are not deasserted and crypto >> continues to send events. This is what led to the "don't trigger in >> Null" patch where I'm setting the missed flag to avoid recursion. > > Sorry, I am not sure which patch you are talking about here. Can you > provide the full subject line to avoid confusion? Sure, "dma: edma: Detect null slot errors and handle them correctly". >>>> This can be used only for buffers that are contiguous in memory, not >>>> those that are scattered across memory. >>> >>> I was hinting at using the linking facility of EDMA to achieve this. >>> Each PaRAM set has full 32-bit source and destination pointers so I see >>> no reason why non-contiguous case cannot be handled. >>> >>> Lets say you need to transfer SG[0..6] on channel C. Now, PaRAM sets are >>> typically 4 times the number of channels. In this case we use one DMA >>> PaRAM set and two Link PaRAM sets per channel. P0 is the DMA PaRAM set >>> and P1 and P2 are the Link sets. >>> >>> Initial setup: >>> >>> SG0 -> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>> ^ ^ ^ >>> | | | >>> P0 -> P1 -> P2 -> NULL >>> >>> P[0..2].TCINTEN = 1, so get an interrupt after each SG element >>> completion. On each completion interrupt, hardware automatically copies >>> the linked PaRAM set into the DMA PaRAM set so after SG0 is transferred >>> out, the state of hardware is: >>> >>> SG1 -> SG2 -> SG3 -> SG3 -> SG6 -> NULL >>> ^ ^ >>> | | >>> P0,1 P2 -> NULL >>> | ^ >>> | | >>> --------- >>> >>> SG1 transfer has already started by the time the TC interrupt is >>> handled. As you can see P1 is now redundant and ready to be recycled. So >>> in the interrupt handler, software recycles P1. Thus: >>> >>> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>> ^ ^ ^ >>> | | | >>> P0 -> P2 -> P1 -> NULL >>> >>> Now, on next interrupt, P2 gets copied and thus can get recycled. >>> Hardware state: >>> >>> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>> ^ ^ >>> | | >>> P0,2 P1 -> NULL >>> | ^ >>> | | >>> --------- >>> >>> As part of TC completion interrupt handling: >>> >>> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>> ^ ^ ^ >>> | | | >>> P0 -> P1 -> P2 -> NULL >>> >>> This goes on until the SG list in exhausted. If you use more PaRAM sets, >>> interrupt handler gets more time to recycle the PaRAM set. At no point >>> we touch P0 as it is always under active transfer. Thus the peripheral >>> is always kept busy. >>> >>> Do you see any reason why such a mechanism cannot be implemented? >> >> This is possible and looks like another way to do it, but there are 2 >> problems I can see with it. >> >> 1. Its inefficient because of too many interrupts: >> >> Imagine case where we have an SG list of size 30 and MAX_NR_SG size is >> 10. This method will trigger 30 interrupts always, where as with my >> patch series, you'd get only 3 interrupts. If you increase MAX_SG_NR , >> you'd get even fewer interrupts. > > Yes, but you are seeing only one side of inefficiency. In your design > DMA *always* stalls waiting for CPU to intervene. The whole point to DMA > is to keep it going while CPU does bookeeping in background. This is > simply not going to scale with fast peripherals. Agreed. So far though, I've no way to reproduce a fast peripheral that scatters data across physical memory and suffers from any stall. > Besides, missed events are error conditions as far as EDMA and the > peripheral is considered. You are handling error interrupt to support a > successful transaction. Think about why EDMA considers missed events as > error condition. I agree with this, its not the best way to do it. I have been working on a different approach. However, in support of the series: 1. It doesn't break any existing code 2. It works for all current DMA users (performance and correctness) 3. It removes the SG limitations on DMA users. So what you suggested, would be more of a feature addition than a limitation of this series. It is atleast better than what's being done now - forcing the limit to the total number of SGs, so it is a step in the right direction. >> 2. If the interrupt handler for some reason doesn't complete or get >> service in time, we will end up DMA'ing incorrect data as events >> wouldn't stop coming in even if interrupt is not yet handled (in your >> example linked sets P1 or P2 would be old ones being repeated). Where as >> with my method, we are not doing any DMA once we finish the current >> MAX_NR_SG set even if events continue to come. > > Where is repetition and possibility of wrong data being transferred? We > have a linear list of PaRAM sets - not a loop. You would link the end to > PaRAM set chain to dummy PaRAM set which BTW will not cause missed > events. The more number of PaRAM sets you add to the chain, the more There would have to be a loop, how else would you ensure continuity and uninterrupted DMA? Consider if you have 2 sets of linked sets: L1 is the first set of Linked sets and L2 is the second. When L1 is done, EDMA continues with L2 (due to the link) while interrupt handler prepares L1. The continuity depends on L1 being linked to L2. Only the absolute last break up of the MAX_NR_SG linked set will be linked to Dummy. So consider MAX_NR_SG=10, and sg_len = 35 L1 - L2 - L1 - L1 - Dummy The split would be in number of slots, 10 - 10 - 10 - 5 - Dummy > time CPU gets to intervene before DMA eventually stalls. This is a > tradeoff system designers can manage. Consider what happens in the case where MAX_SG_NR=1 or 2. In that case, there's a change we might not get enough time for the interrupt handler to setup next series of linked set. Some how this limitation has to be overcome by advising in comments than MAX_SG_NR should always be greater than a certain number to ensure proper operation. Thanks, -Joel
Just some corrections here.. On 08/01/2013 03:28 PM, Joel Fernandes wrote: >>> 2. If the interrupt handler for some reason doesn't complete or get >>> service in time, we will end up DMA'ing incorrect data as events >>> wouldn't stop coming in even if interrupt is not yet handled (in your >>> example linked sets P1 or P2 would be old ones being repeated). Where as >>> with my method, we are not doing any DMA once we finish the current >>> MAX_NR_SG set even if events continue to come. >> >> Where is repetition and possibility of wrong data being transferred? We >> have a linear list of PaRAM sets - not a loop. You would link the end to >> PaRAM set chain to dummy PaRAM set which BTW will not cause missed >> events. The more number of PaRAM sets you add to the chain, the more > > There would have to be a loop, how else would you ensure continuity and > uninterrupted DMA? > > Consider if you have 2 sets of linked sets: > L1 is the first set of Linked sets and L2 is the second. > > When L1 is done, EDMA continues with L2 (due to the link) while > interrupt handler prepares L1. The continuity depends on L1 being linked > to L2. Only the absolute last break up of the MAX_NR_SG linked set will > be linked to Dummy. > > So consider MAX_NR_SG=10, and sg_len = 35 > > L1 - L2 - L1 - L1 - Dummy Should be, L1 - L2 - L1 - L2 - Dummy > > The split would be in number of slots, > 10 - 10 - 10 - 5 - Dummy > >> time CPU gets to intervene before DMA eventually stalls. This is a >> tradeoff system designers can manage. > > Consider what happens in the case where MAX_SG_NR=1 or 2. In that case, > there's a change we might not get enough time for the interrupt handler > to setup next series of linked set. > > Some how this limitation has to be overcome by advising in comments than > MAX_SG_NR should always be greater than a certain number to ensure > proper operation. s/than/that/ Thanks, -Joel
On 8/2/2013 1:58 AM, Joel Fernandes wrote: > On 08/01/2013 01:13 AM, Sekhar Nori wrote: >> On Thursday 01 August 2013 07:57 AM, Joel Fernandes wrote: >>> On 07/31/2013 04:18 AM, Sekhar Nori wrote: >>>> On Wednesday 31 July 2013 10:19 AM, Joel Fernandes wrote: >>>>> Hi Sekhar, >>>>> >>>>> On 07/30/2013 02:05 AM, Sekhar Nori wrote: >>>>>> On Monday 29 July 2013 06:59 PM, Joel Fernandes wrote: >>>>>>> In an effort to move to using Scatter gather lists of any size with >>>>>>> EDMA as discussed at [1] instead of placing limitations on the driver, >>>>>>> we work through the limitations of the EDMAC hardware to find missed >>>>>>> events and issue them. >>>>>>> >>>>>>> The sequence of events that require this are: >>>>>>> >>>>>>> For the scenario where MAX slots for an EDMA channel is 3: >>>>>>> >>>>>>> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> Null >>>>>>> >>>>>>> The above SG list will have to be DMA'd in 2 sets: >>>>>>> >>>>>>> (1) SG1 -> SG2 -> SG3 -> Null >>>>>>> (2) SG4 -> SG5 -> SG6 -> Null >>>>>>> >>>>>>> After (1) is succesfully transferred, the events from the MMC controller >>>>>>> donot stop coming and are missed by the time we have setup the transfer >>>>>>> for (2). So here, we catch the events missed as an error condition and >>>>>>> issue them manually. >>>>>> >>>>>> Are you sure there wont be any effect of these missed events on the >>>>>> peripheral side. For example, wont McASP get into an underrun condition >>>>>> when it encounters a null PaRAM set? Even UART has to transmit to a >>>>> >>>>> But it will not encounter null PaRAM set because McASP uses contiguous >>>>> buffers for transfer which are not scattered across physical memory. >>>>> This can be accomplished with an SG of size 1. For such SGs, this patch >>>>> series leaves it linked Dummy and does not link to Null set. Null set is >>>>> only used for SG lists that are > MAX_NR_SG in size such as those >>>>> created for example by MMC and Crypto. >>>>> >>>>>> particular baud so I guess it cannot wait like the way MMC/SD can. >>>>> >>>>> Existing driver have to wait anyway if they hit MAX SG limit today. If >>>>> they don't want to wait, they would have allocated a contiguous block of >>>>> memory and DMA that in one stretch so they don't lose any events, and in >>>>> such cases we are not linking to Null. >>>> >>>> As long as DMA driver can advertize its MAX SG limit, peripherals can >>>> always work around that by limiting the number of sync events they >>>> generate so as to not having any of the events getting missed. With this >>>> series, I am worried that EDMA drivers is advertizing that it can handle >>>> any length SG list while not taking care of missing any events while >>>> doing so. This will break the assumptions that driver writers make. >>> >>> This is already being done by some other DMA engine drivers ;). We can >>> advertise more than we can handle at a time, that's the basis of this >>> whole idea. >>> >>> I understand what you're saying but events are not something that have >>> be serviced immediately, they can be queued etc and the actually >>> transfer from the DMA controller can be delayed. As long as we don't >>> miss the event we are fine which my series takes care off. >>> >>> So far I have tested this series on following modules in various >>> configurations and have seen no issues: >>> - Crypto AES >>> - MMC/SD >>> - SPI (128x160 display) >> >> Notice how in each of these cases the peripheral is in control of when >> data is driven out? Please test with McASP in a configuration where >> codec drives the frame-sync/bit-clock or with UART under high baud rate. > > McASP allocates a contiguous buffer. For this case there is always an SG > of size 1 and this patch series doesn't effect it at all, there is not > stalling. Further McASP audio driver is still awaiting conversion to use > DMA engine so there's no way yet to test it. Okay, looks like omap-serial does not use DMA as well so you cannot use that. Anyway, my point is beyond what the McASP driver does currently. Once you expose "the handle any number of SGs" feature from EDMA driver, any client is free to use it. So we need to think ahead to see if we break any use cases. > >>>>>> Also, wont this lead to under-utilization of the peripheral bandwith? >>>>>> Meaning, MMC/SD is ready with data but cannot transfer because the DMA >>>>>> is waiting to be set-up. >>>>> >>>>> But it is waiting anyway even today. Currently based on MAX segs, MMC >>>>> driver/subsystem will make SG list of size max_segs. Between these >>>>> sessions of creating such smaller SG-lists, if for some reason the MMC >>>>> controller is sending events, these will be lost anyway. >>>> >>>> But if MMC/SD driver knows how many events it should generate if it >>>> knows the MAX SG limit. So there should not be any missed events in >>>> current code. And I am not claiming that your solution is making matters >>>> worse. But its not making it much better as well. >>> >>> This is not true for crypto, the events are not deasserted and crypto >>> continues to send events. This is what led to the "don't trigger in >>> Null" patch where I'm setting the missed flag to avoid recursion. >> >> Sorry, I am not sure which patch you are talking about here. Can you >> provide the full subject line to avoid confusion? > > Sure, "dma: edma: Detect null slot errors and handle them correctly". > >>>>> This can be used only for buffers that are contiguous in memory, not >>>>> those that are scattered across memory. >>>> >>>> I was hinting at using the linking facility of EDMA to achieve this. >>>> Each PaRAM set has full 32-bit source and destination pointers so I see >>>> no reason why non-contiguous case cannot be handled. >>>> >>>> Lets say you need to transfer SG[0..6] on channel C. Now, PaRAM sets are >>>> typically 4 times the number of channels. In this case we use one DMA >>>> PaRAM set and two Link PaRAM sets per channel. P0 is the DMA PaRAM set >>>> and P1 and P2 are the Link sets. >>>> >>>> Initial setup: >>>> >>>> SG0 -> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>>> ^ ^ ^ >>>> | | | >>>> P0 -> P1 -> P2 -> NULL >>>> >>>> P[0..2].TCINTEN = 1, so get an interrupt after each SG element >>>> completion. On each completion interrupt, hardware automatically copies >>>> the linked PaRAM set into the DMA PaRAM set so after SG0 is transferred >>>> out, the state of hardware is: >>>> >>>> SG1 -> SG2 -> SG3 -> SG3 -> SG6 -> NULL >>>> ^ ^ >>>> | | >>>> P0,1 P2 -> NULL >>>> | ^ >>>> | | >>>> --------- >>>> >>>> SG1 transfer has already started by the time the TC interrupt is >>>> handled. As you can see P1 is now redundant and ready to be recycled. So >>>> in the interrupt handler, software recycles P1. Thus: >>>> >>>> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>>> ^ ^ ^ >>>> | | | >>>> P0 -> P2 -> P1 -> NULL >>>> >>>> Now, on next interrupt, P2 gets copied and thus can get recycled. >>>> Hardware state: >>>> >>>> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>>> ^ ^ >>>> | | >>>> P0,2 P1 -> NULL >>>> | ^ >>>> | | >>>> --------- >>>> >>>> As part of TC completion interrupt handling: >>>> >>>> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>>> ^ ^ ^ >>>> | | | >>>> P0 -> P1 -> P2 -> NULL >>>> >>>> This goes on until the SG list in exhausted. If you use more PaRAM sets, >>>> interrupt handler gets more time to recycle the PaRAM set. At no point >>>> we touch P0 as it is always under active transfer. Thus the peripheral >>>> is always kept busy. >>>> >>>> Do you see any reason why such a mechanism cannot be implemented? >>> >>> This is possible and looks like another way to do it, but there are 2 >>> problems I can see with it. >>> >>> 1. Its inefficient because of too many interrupts: >>> >>> Imagine case where we have an SG list of size 30 and MAX_NR_SG size is >>> 10. This method will trigger 30 interrupts always, where as with my >>> patch series, you'd get only 3 interrupts. If you increase MAX_SG_NR , >>> you'd get even fewer interrupts. >> >> Yes, but you are seeing only one side of inefficiency. In your design >> DMA *always* stalls waiting for CPU to intervene. The whole point to DMA >> is to keep it going while CPU does bookeeping in background. This is >> simply not going to scale with fast peripherals. > > Agreed. So far though, I've no way to reproduce a fast peripheral that > scatters data across physical memory and suffers from any stall. > >> Besides, missed events are error conditions as far as EDMA and the >> peripheral is considered. You are handling error interrupt to support a >> successful transaction. Think about why EDMA considers missed events as >> error condition. > > I agree with this, its not the best way to do it. I have been working on > a different approach. > > However, in support of the series: > 1. It doesn't break any existing code > 2. It works for all current DMA users (performance and correctness) > 3. It removes the SG limitations on DMA users. Right, all of this should be true even with the approach I am suggesting. > So what you suggested, would be more of a feature addition than a > limitation of this series. It is atleast better than what's being done > now - forcing the limit to the total number of SGs, so it is a step in > the right direction. No, I do not see my approach is an feature addition to what you are doing. They are both very contrasting ways. For example, you would not need the manual (re)trigger in CC error condition in what I am proposing. > >>> 2. If the interrupt handler for some reason doesn't complete or get >>> service in time, we will end up DMA'ing incorrect data as events >>> wouldn't stop coming in even if interrupt is not yet handled (in your >>> example linked sets P1 or P2 would be old ones being repeated). Where as >>> with my method, we are not doing any DMA once we finish the current >>> MAX_NR_SG set even if events continue to come. >> >> Where is repetition and possibility of wrong data being transferred? We >> have a linear list of PaRAM sets - not a loop. You would link the end to >> PaRAM set chain to dummy PaRAM set which BTW will not cause missed >> events. The more number of PaRAM sets you add to the chain, the more > > There would have to be a loop, how else would you ensure continuity and > uninterrupted DMA? Uninterrupted DMA comes because of PaRAM set recycling. In my diagrams above, hardware is *always* using P0 for transfer while software always updates the tail of PaRAM linked list. > > Consider if you have 2 sets of linked sets: > L1 is the first set of Linked sets and L2 is the second. I think this is where there is confusion. I am using only one linked set of PaRAM entries (P0->P1->P2->DUMMY). If you need more time to service the interrupt before the DMA hits the dummy PaRAM you allocate more link PaRAM sets for the channel (P0->P1->...Pn->DUMMY). At no point was I suggesting having two sets of linked PaRAM sets. Why would you need something like that?
Hi Sekhar, Thanks for your detailed illustrations. On 08/02/2013 08:26 AM, Sekhar Nori wrote: [..] >>>>>> This can be used only for buffers that are contiguous in memory, not >>>>>> those that are scattered across memory. >>>>> >>>>> I was hinting at using the linking facility of EDMA to achieve this. >>>>> Each PaRAM set has full 32-bit source and destination pointers so I see >>>>> no reason why non-contiguous case cannot be handled. >>>>> >>>>> Lets say you need to transfer SG[0..6] on channel C. Now, PaRAM sets are >>>>> typically 4 times the number of channels. In this case we use one DMA >>>>> PaRAM set and two Link PaRAM sets per channel. P0 is the DMA PaRAM set >>>>> and P1 and P2 are the Link sets. >>>>> >>>>> Initial setup: >>>>> >>>>> SG0 -> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>>>> ^ ^ ^ >>>>> | | | >>>>> P0 -> P1 -> P2 -> NULL >>>>> >>>>> P[0..2].TCINTEN = 1, so get an interrupt after each SG element >>>>> completion. On each completion interrupt, hardware automatically copies >>>>> the linked PaRAM set into the DMA PaRAM set so after SG0 is transferred >>>>> out, the state of hardware is: >>>>> >>>>> SG1 -> SG2 -> SG3 -> SG3 -> SG6 -> NULL >>>>> ^ ^ >>>>> | | >>>>> P0,1 P2 -> NULL >>>>> | ^ >>>>> | | >>>>> --------- >>>>> >>>>> SG1 transfer has already started by the time the TC interrupt is >>>>> handled. As you can see P1 is now redundant and ready to be recycled. So >>>>> in the interrupt handler, software recycles P1. Thus: >>>>> >>>>> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>>>> ^ ^ ^ >>>>> | | | >>>>> P0 -> P2 -> P1 -> NULL >>>>> >>>>> Now, on next interrupt, P2 gets copied and thus can get recycled. >>>>> Hardware state: >>>>> >>>>> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>>>> ^ ^ >>>>> | | >>>>> P0,2 P1 -> NULL >>>>> | ^ >>>>> | | >>>>> --------- >>>>> >>>>> As part of TC completion interrupt handling: >>>>> >>>>> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>>>> ^ ^ ^ >>>>> | | | >>>>> P0 -> P1 -> P2 -> NULL >>>>> >>>>> This goes on until the SG list in exhausted. If you use more PaRAM sets, >>>>> interrupt handler gets more time to recycle the PaRAM set. At no point >>>>> we touch P0 as it is always under active transfer. Thus the peripheral >>>>> is always kept busy. >>>>> >>>>> Do you see any reason why such a mechanism cannot be implemented? >>>> >>>> This is possible and looks like another way to do it, but there are 2 >>>> problems I can see with it. >>>> >>>> 1. Its inefficient because of too many interrupts: >>>> >>>> Imagine case where we have an SG list of size 30 and MAX_NR_SG size is >>>> 10. This method will trigger 30 interrupts always, where as with my >>>> patch series, you'd get only 3 interrupts. If you increase MAX_SG_NR , >>>> you'd get even fewer interrupts. >>> >>> Yes, but you are seeing only one side of inefficiency. In your design >>> DMA *always* stalls waiting for CPU to intervene. The whole point to DMA >>> is to keep it going while CPU does bookeeping in background. This is >>> simply not going to scale with fast peripherals. >> >> Agreed. So far though, I've no way to reproduce a fast peripheral that >> scatters data across physical memory and suffers from any stall. >> >>> Besides, missed events are error conditions as far as EDMA and the >>> peripheral is considered. You are handling error interrupt to support a >>> successful transaction. Think about why EDMA considers missed events as >>> error condition. >> >> I agree with this, its not the best way to do it. I have been working on >> a different approach. >> >> However, in support of the series: >> 1. It doesn't break any existing code >> 2. It works for all current DMA users (performance and correctness) >> 3. It removes the SG limitations on DMA users. > > Right, all of this should be true even with the approach I am suggesting. > >> So what you suggested, would be more of a feature addition than a >> limitation of this series. It is atleast better than what's being done >> now - forcing the limit to the total number of SGs, so it is a step in >> the right direction. > > No, I do not see my approach is an feature addition to what you are > doing. They are both very contrasting ways. For example, you would not > need the manual (re)trigger in CC error condition in what I am proposing. > >> >>>> 2. If the interrupt handler for some reason doesn't complete or get >>>> service in time, we will end up DMA'ing incorrect data as events >>>> wouldn't stop coming in even if interrupt is not yet handled (in your >>>> example linked sets P1 or P2 would be old ones being repeated). Where as >>>> with my method, we are not doing any DMA once we finish the current >>>> MAX_NR_SG set even if events continue to come. >>> >>> Where is repetition and possibility of wrong data being transferred? We >>> have a linear list of PaRAM sets - not a loop. You would link the end to >>> PaRAM set chain to dummy PaRAM set which BTW will not cause missed >>> events. The more number of PaRAM sets you add to the chain, the more >> >> There would have to be a loop, how else would you ensure continuity and >> uninterrupted DMA? > > Uninterrupted DMA comes because of PaRAM set recycling. In my diagrams > above, hardware is *always* using P0 for transfer while software always > updates the tail of PaRAM linked list. > >> >> Consider if you have 2 sets of linked sets: >> L1 is the first set of Linked sets and L2 is the second. > > I think this is where there is confusion. I am using only one linked set > of PaRAM entries (P0->P1->P2->DUMMY). If you need more time to service > the interrupt before the DMA hits the dummy PaRAM you allocate more link > PaRAM sets for the channel (P0->P1->...Pn->DUMMY). At no point was I > suggesting having two sets of linked PaRAM sets. Why would you need > something like that? > I think we are talking about the same thing. Let's for now discuss having just 1 linked set to avoid confusion, that's fine. I think where we are differing in our understanding, is the dummy link comes into picture only when we are transferring the *last* SG. For all others there is a cyclic link between P1 and P2. Would you agree? Even in your diagrams you are actually showing such a cyclic link >>>>> >>>>> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>>>> ^ ^ ^ >>>>> | | | >>>>> P0 -> P2 -> P1 -> NULL Comparing this.. >>>>> >>>>> Now, on next interrupt, P2 gets copied and thus can get recycled. >>>>> Hardware state: >>>>> >>>>> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>>>> ^ ^ >>>>> | | >>>>> P0,2 P1 -> NULL >>>>> | ^ >>>>> | | >>>>> --------- >>>>> >>>>> As part of TC completion interrupt handling: >>>>> >>>>> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>>>> ^ ^ ^ >>>>> | | | >>>>> P0 -> P1 -> P2 -> NULL .. with this. Notice that P2 -> P1 became P1 -> P2 The next thing logical diagram would look like: >>>>> >>>>> Now, on next interrupt, P1 gets copied and thus can get recycled. >>>>> Hardware state: >>>>> >>>>> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>>>> ^ ^ >>>>> | | >>>>> P0,1 P2 -> NULL >>>>> | ^ >>>>> | | >>>>> --------- >>>>> >>>>> As part of TC completion interrupt handling: >>>>> >>>>> SG3 -> SG5 -> SG6 -> SG6 -> NULL >>>>> ^ ^ ^ >>>>> | | | >>>>> P0 -> P2 -> P1 -> NULL "P1 gets copied" happens only because of the cyclic link from P2 to P1, it wouldn't have happened if P2 was linked to Dummy as you described. Now coming to 2 linked sets vs 1, I meant the same thing that to give interrupt handler more time, we could have something like: >>>>> As part of TC completion interrupt handling: >>>>> >>>>> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> NULL >>>>> ^ ^ ^ >>>>> | | | >>>>> P0 -> P1 -> P2 -> P3 -> P4 -> Null So what I was describing as 2 sets of linked sets is P1 and P2 being 1 set, and P3 and P4 being another set. We would then recycle a complete set at the same time. That way interrupt handler could do more at once and get more time to recycle. So we would setup TC interrupts only for P2 and P4 in the above diagrams. Thanks, -Joel
Hi Sekhar, Considering you agree with my understanding of the approach you proposed, I worked on some code to quickly try the different approach (ping-pong) between sets, here is a hack patch: https://github.com/joelagnel/linux-kernel/commits/dma/edma-no-sg-limits-interleaved As I suspected it also has problems with missing interrupts, coming back to my other point about getting errors if ISR doesn't get enough time to setup for the next transfer. If you'd use < 5 MAX_NR slots you start seeing EDMA errors. For > 5 slots, I don't see errors, but there is stalling because of missed interrupts. I observe that for an SG-list of size 10, it takes atleast 7 ms before the interrupt handlers (ISR) gets a chance to execute. This I feel is quite long, what is your opinion about this? Describing my approach here: If MAX slots is 10 for example, we split it into 2 cyclically linked sets of size 5 each. Interrupts are setup to trigger for every 5 PaRAM set transfers. After the first 5 transfer, the ISR recycles them for the next 5 entries in the SG-list. This happens in parallel/simultaneously as the second set of 5 are being transferred. Thanks, -Joel On 08/02/2013 01:15 PM, Joel Fernandes wrote:[..] > Even in your diagrams you are actually showing such a cyclic link > > >>>>>> >>>>>> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>>>>> ^ ^ ^ >>>>>> | | | >>>>>> P0 -> P2 -> P1 -> NULL > > Comparing this.. > >>>>>> >>>>>> Now, on next interrupt, P2 gets copied and thus can get recycled. >>>>>> Hardware state: >>>>>> >>>>>> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>>>>> ^ ^ >>>>>> | | >>>>>> P0,2 P1 -> NULL >>>>>> | ^ >>>>>> | | >>>>>> --------- >>>>>> >>>>>> As part of TC completion interrupt handling: >>>>>> >>>>>> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>>>>> ^ ^ ^ >>>>>> | | | >>>>>> P0 -> P1 -> P2 -> NULL > > .. with this. Notice that P2 -> P1 became P1 -> P2 > > The next thing logical diagram would look like: > >>>>>> >>>>>> Now, on next interrupt, P1 gets copied and thus can get recycled. >>>>>> Hardware state: >>>>>> >>>>>> SG3 -> SG4 -> SG5 -> SG6 -> NULL >>>>>> ^ ^ >>>>>> | | >>>>>> P0,1 P2 -> NULL >>>>>> | ^ >>>>>> | | >>>>>> --------- >>>>>> >>>>>> As part of TC completion interrupt handling: >>>>>> >>>>>> SG3 -> SG5 -> SG6 -> SG6 -> NULL >>>>>> ^ ^ ^ >>>>>> | | | >>>>>> P0 -> P2 -> P1 -> NULL > > > "P1 gets copied" happens only because of the cyclic link from P2 to P1, > it wouldn't have happened if P2 was linked to Dummy as you described. > > Now coming to 2 linked sets vs 1, I meant the same thing that to give > interrupt handler more time, we could have something like: > >>>>>> As part of TC completion interrupt handling: >>>>>> >>>>>> SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> NULL >>>>>> ^ ^ ^ >>>>>> | | | >>>>>> P0 -> P1 -> P2 -> P3 -> P4 -> Null > > So what I was describing as 2 sets of linked sets is P1 and P2 being 1 > set, and P3 and P4 being another set. We would then recycle a complete > set at the same time. That way interrupt handler could do more at once > and get more time to recycle. So we would setup TC interrupts only for > P2 and P4 in the above diagrams. > > Thanks, > > -Joel >
diff --git a/drivers/dma/edma.c b/drivers/dma/edma.c index d9a151b..aa4989f 100644 --- a/drivers/dma/edma.c +++ b/drivers/dma/edma.c @@ -417,7 +417,15 @@ static void edma_callback(unsigned ch_num, u16 ch_status, void *data) break; case DMA_CC_ERROR: - dev_dbg(dev, "transfer error on channel %d\n", ch_num); + if (echan->edesc) { + dev_dbg(dev, "Missed event on %d, retrying\n", + ch_num); + edma_clean_channel(echan->ch_num); + edma_stop(echan->ch_num); + edma_start(echan->ch_num); + edma_manual_trigger(echan->ch_num); + } + dev_dbg(dev, "handled error on channel %d\n", ch_num); break; default: break;
In an effort to move to using Scatter gather lists of any size with EDMA as discussed at [1] instead of placing limitations on the driver, we work through the limitations of the EDMAC hardware to find missed events and issue them. The sequence of events that require this are: For the scenario where MAX slots for an EDMA channel is 3: SG1 -> SG2 -> SG3 -> SG4 -> SG5 -> SG6 -> Null The above SG list will have to be DMA'd in 2 sets: (1) SG1 -> SG2 -> SG3 -> Null (2) SG4 -> SG5 -> SG6 -> Null After (1) is succesfully transferred, the events from the MMC controller donot stop coming and are missed by the time we have setup the transfer for (2). So here, we catch the events missed as an error condition and issue them manually. [1] http://marc.info/?l=linux-omap&m=137416733628831&w=2 Signed-off-by: Joel Fernandes <joelf@ti.com> --- drivers/dma/edma.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)