diff mbox

[Bug] VCHIQ functional test broken

Message ID 952745481.440383.1494666448727@email.1und1.de (mailing list archive)
State New, archived
Headers show

Commit Message

Stefan Wahren May 13, 2017, 9:07 a.m. UTC
> Stefan Wahren <stefan.wahren@i2se.com> hat am 24. April 2017 um 21:35 geschrieben:
> 
> 
> 
> > Russell King - ARM Linux <linux@armlinux.org.uk> hat am 24. April 2017 um 20:59 geschrieben:
> > 
> > 
> > On Mon, Apr 24, 2017 at 07:42:27PM +0200, Stefan Wahren wrote:
> > > > What I can't see is how changing flush_dcache_page() has possibly broken
> > > > this: it seems to imply that you're getting flush_dcache_page() somehow
> > > > called inbetween mapping it for DMA, using it for DMA and CPU accesses
> > > > occuring.  However, the driver doesn't call flush_dcache_page() other
> > > > than through get_user_pages() - and since dma_map_sg() is used in that
> > > > path, it should be fine.
> > > > 
> > > > So, I don't have much to offer.
> > > > 
> > > > However, flush_dcache_page() is probably still a tad heavy - it currently
> > > > flushes to the point of coherency, but it's really about making sure that
> > > > the user vs kernel mappings are coherent, not about device coherency.
> > > > That's the role of the DMA API.
> > > 
> > > Currently i don't care too much about performance. More important is to
> > > fix this regression, because i'm not able to verify the function of this
> > > driver efficiently.
> > 
> > This is a driver in staging, and the reason its in staging is because it
> > has problems.  This is just another example of another problem it has...
> > Yes, the regression is regrettable, but I don't think it's something to
> > get hung up on.
> > 
> > > I'm thinking about 2 options:
> > > 
> > > 1) add a force parameter to flush_dcache_page() which make it possible
> > >    to override the new logic
> > > 2) create a second version of flush_dcache_page() with the old behavior
> > 
> > Neither, really, because, as I've already explained, flush_dcache_page()
> > has nothing what so ever to do with DMA coherence - and if a driver
> > breaks because we change its semantic slightly (but still in a compliant
> > way) then it's uncovered a latent bug in _that_ driver.
> > 
> > Rather than trying to "fix" flush_dcache_page() to work how this driver
> > wants it to (which is a totally crazy thing to propose - we can't go
> > shoe-horning driver specific behaviour into these generic flushing
> > functions), it would be much better to work out why the driver has
> > broken.
> 
> I totally agree. I had the hope we could make a workaround within the driver until we found the real issue.
> 
> > 
> > I see the kernel driver has its own logging (using vchiq_log_info() etc?)
> > maybe a starting point would be to compare the output from a working
> > kernel with a non-working kernel to get more of an idea where the problem
> > actually is?
> 
> I will try ...
> 
> > 
> > What I'm willing to do is to temporarily drop the offending patch for
> > the next merge window to give more time for this driver to be debugged,
> > but I will want to re-apply it after that merge window closes.
> 
> Since this is already in 4.11, please keep it. At least this makes it easier to reproduce. No need for more confusion and effort.
> 

In the meantime this issue has been fixed by Phil [1].

Unfortunately i found another issue. If i enable CONFIG_HIGHMEM in the kernel config, the data during functional test gets corrupted. Phil said it's caused by the usage of get_user_pages() [2]. That makes me wonder, because there are a lot of gup users and it's hard to believe that they are also affected.

Would that be hard to fix?

In that case i tend to fix at least the dependency:


Thanks
Stefan

[1] - http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2017-May/105443.html
[2] - https://github.com/raspberrypi/linux/issues/1996

Comments

Russell King (Oracle) May 13, 2017, 9:30 a.m. UTC | #1
On Sat, May 13, 2017 at 11:07:28AM +0200, Stefan Wahren wrote:
> In the meantime this issue has been fixed by Phil [1].

Right - definitely a driver bug.  Mapping more memory for DMA than is
actually going to be DMA'd to and expecting data to be preserved is
really horrid.

> Unfortunately i found another issue. If i enable CONFIG_HIGHMEM in
> the kernel config, the data during functional test gets corrupted.
> Phil said it's caused by the usage of get_user_pages() [2].

Without knowing who "Phil" is in that thread, but...

   HIGHMEM is a problem because you can't use get_user_pages on pages in
   HIGHMEM.

is an interesting statement, and without any reasoning or evidence.

I also believe it to be incorrect.  get_user_pages() returns an array
of struct page pointers for the user memory, calling flush_dcache_page()
and flush_anon_page() on them to ensure that any kernel mapping is
coherent with what is in userspace.

As far as returning the array of page pointers, get_user_pages() doesn't
care whether they're lowmem or highmem.

flush_dcache_page() doesn't care either - if it wants to flush the page
and the page is a highmem page, it will temporarily map it before
flushing it.

flush_anon_page() is a no-op for all non-aliasing caches.

get_user_pages() works fine for whatever memory on other platforms and
drivers such as etnaviv, so I think this comes down to the vchip driver
doing things in ways that the kernel interfaces its using don't expect -
exactly like the "lets pass full pages to the DMA API" broken-ness.

I would like to hear the justification for that statement, but without
any justification, I assert that the statement is false.
Phil Elwell May 15, 2017, 2:29 p.m. UTC | #2
On 13/05/2017 10:30, Russell King - ARM Linux wrote:
> On Sat, May 13, 2017 at 11:07:28AM +0200, Stefan Wahren wrote:
>> In the meantime this issue has been fixed by Phil [1].
> 
> Right - definitely a driver bug.  Mapping more memory for DMA than is
> actually going to be DMA'd to and expecting data to be preserved is
> really horrid.

That feature was added during the upstreaming process, and as Stefan says
there is an outstanding patch for it.

>> Unfortunately i found another issue. If i enable CONFIG_HIGHMEM in
>> the kernel config, the data during functional test gets corrupted.
>> Phil said it's caused by the usage of get_user_pages() [2].
> 
> Without knowing who "Phil" is in that thread, but...
> 
>    HIGHMEM is a problem because you can't use get_user_pages on pages in
>    HIGHMEM.
> 
> is an interesting statement, and without any reasoning or evidence.
> 
> I also believe it to be incorrect.  get_user_pages() returns an array
> of struct page pointers for the user memory, calling flush_dcache_page()
> and flush_anon_page() on them to ensure that any kernel mapping is
> coherent with what is in userspace.
> 
> As far as returning the array of page pointers, get_user_pages() doesn't
> care whether they're lowmem or highmem.
> 
> flush_dcache_page() doesn't care either - if it wants to flush the page
> and the page is a highmem page, it will temporarily map it before
> flushing it.
> 
> flush_anon_page() is a no-op for all non-aliasing caches.
> 
> get_user_pages() works fine for whatever memory on other platforms and
> drivers such as etnaviv, so I think this comes down to the vchip driver
> doing things in ways that the kernel interfaces its using don't expect -
> exactly like the "lets pass full pages to the DMA API" broken-ness.

See previous comment.

> I would like to hear the justification for that statement, but without
> any justification, I assert that the statement is false.

I am the Phil in question, and the off-the-cuff comment was the result of
a hazy memory of issues encountered with VCHIQ bulk transfers as a Broadcom
employee (which would have been on a 2.6 kernel). I suspect there may have
been some use of kernel virtual addresses as an intermediate representation,
but I no longer have access to that code.

If get_user_pages is HIGHMEM-safe (and I can see why it would be), then the
cause of the corruption Stefan saw is probably the special handling of
unaligned reads, specifically:

			memcpy((char *)page_address(pages[0]) +
				pagelist->offset,
				fragments,
				head_bytes);

Phil
Stefan Wahren May 15, 2017, 2:54 p.m. UTC | #3
Am 15.05.2017 um 16:29 schrieb Phil Elwell:
> On 13/05/2017 10:30, Russell King - ARM Linux wrote:
>> On Sat, May 13, 2017 at 11:07:28AM +0200, Stefan Wahren wrote:
>>> In the meantime this issue has been fixed by Phil [1].
>> Right - definitely a driver bug.  Mapping more memory for DMA than is
>> actually going to be DMA'd to and expecting data to be preserved is
>> really horrid.
> That feature was added during the upstreaming process, and as Stefan says
> there is an outstanding patch for it.
>
>>> Unfortunately i found another issue. If i enable CONFIG_HIGHMEM in
>>> the kernel config, the data during functional test gets corrupted.
>>> Phil said it's caused by the usage of get_user_pages() [2].
>> Without knowing who "Phil" is in that thread, but...
>>
>>    HIGHMEM is a problem because you can't use get_user_pages on pages in
>>    HIGHMEM.
>>
>> is an interesting statement, and without any reasoning or evidence.
>>
>> I also believe it to be incorrect.  get_user_pages() returns an array
>> of struct page pointers for the user memory, calling flush_dcache_page()
>> and flush_anon_page() on them to ensure that any kernel mapping is
>> coherent with what is in userspace.
>>
>> As far as returning the array of page pointers, get_user_pages() doesn't
>> care whether they're lowmem or highmem.
>>
>> flush_dcache_page() doesn't care either - if it wants to flush the page
>> and the page is a highmem page, it will temporarily map it before
>> flushing it.
>>
>> flush_anon_page() is a no-op for all non-aliasing caches.
>>
>> get_user_pages() works fine for whatever memory on other platforms and
>> drivers such as etnaviv, so I think this comes down to the vchip driver
>> doing things in ways that the kernel interfaces its using don't expect -
>> exactly like the "lets pass full pages to the DMA API" broken-ness.
> See previous comment.
>
>> I would like to hear the justification for that statement, but without
>> any justification, I assert that the statement is false.
> I am the Phil in question, and the off-the-cuff comment was the result of
> a hazy memory of issues encountered with VCHIQ bulk transfers as a Broadcom
> employee (which would have been on a 2.6 kernel). I suspect there may have
> been some use of kernel virtual addresses as an intermediate representation,
> but I no longer have access to that code.
>
> If get_user_pages is HIGHMEM-safe (and I can see why it would be), then the
> cause of the corruption Stefan saw is probably the special handling of
> unaligned reads, specifically:
>
> 			memcpy((char *)page_address(pages[0]) +
> 				pagelist->offset,
> 				fragments,
> 				head_bytes);


Btw shouldn't we use copy_from_user() at this place?
Phil Elwell May 15, 2017, 3:05 p.m. UTC | #4
On 15/05/2017 15:54, Stefan Wahren wrote:
> Am 15.05.2017 um 16:29 schrieb Phil Elwell:
>> On 13/05/2017 10:30, Russell King - ARM Linux wrote:
>>> On Sat, May 13, 2017 at 11:07:28AM +0200, Stefan Wahren wrote:
>>>> In the meantime this issue has been fixed by Phil [1].
>>> Right - definitely a driver bug.  Mapping more memory for DMA than is
>>> actually going to be DMA'd to and expecting data to be preserved is
>>> really horrid.
>> That feature was added during the upstreaming process, and as Stefan says
>> there is an outstanding patch for it.
>>
>>>> Unfortunately i found another issue. If i enable CONFIG_HIGHMEM in
>>>> the kernel config, the data during functional test gets corrupted.
>>>> Phil said it's caused by the usage of get_user_pages() [2].
>>> Without knowing who "Phil" is in that thread, but...
>>>
>>>    HIGHMEM is a problem because you can't use get_user_pages on pages in
>>>    HIGHMEM.
>>>
>>> is an interesting statement, and without any reasoning or evidence.
>>>
>>> I also believe it to be incorrect.  get_user_pages() returns an array
>>> of struct page pointers for the user memory, calling flush_dcache_page()
>>> and flush_anon_page() on them to ensure that any kernel mapping is
>>> coherent with what is in userspace.
>>>
>>> As far as returning the array of page pointers, get_user_pages() doesn't
>>> care whether they're lowmem or highmem.
>>>
>>> flush_dcache_page() doesn't care either - if it wants to flush the page
>>> and the page is a highmem page, it will temporarily map it before
>>> flushing it.
>>>
>>> flush_anon_page() is a no-op for all non-aliasing caches.
>>>
>>> get_user_pages() works fine for whatever memory on other platforms and
>>> drivers such as etnaviv, so I think this comes down to the vchip driver
>>> doing things in ways that the kernel interfaces its using don't expect -
>>> exactly like the "lets pass full pages to the DMA API" broken-ness.
>> See previous comment.
>>
>>> I would like to hear the justification for that statement, but without
>>> any justification, I assert that the statement is false.
>> I am the Phil in question, and the off-the-cuff comment was the result of
>> a hazy memory of issues encountered with VCHIQ bulk transfers as a Broadcom
>> employee (which would have been on a 2.6 kernel). I suspect there may have
>> been some use of kernel virtual addresses as an intermediate representation,
>> but I no longer have access to that code.
>>
>> If get_user_pages is HIGHMEM-safe (and I can see why it would be), then the
>> cause of the corruption Stefan saw is probably the special handling of
>> unaligned reads, specifically:
>>
>> 			memcpy((char *)page_address(pages[0]) +
>> 				pagelist->offset,
>> 				fragments,
>> 				head_bytes);
> 
> 
> Btw shouldn't we use copy_from_user() at this place?

I'm sure you mean copy_to_user(), and the answer is "it's complicated". Depending
on the relative timing, this code can also be called from a kernel thread, so I
doubt that would work.

Phil
diff mbox

Patch

diff --git a/drivers/staging/vc04_services/Kconfig b/drivers/staging/vc04_servic
index 9e27636..6572903 100644
--- a/drivers/staging/vc04_services/Kconfig
+++ b/drivers/staging/vc04_services/Kconfig
@@ -2,7 +2,7 @@  menuconfig BCM_VIDEOCORE
        tristate "Broadcom VideoCore support"
        depends on HAS_DMA
        depends on OF
-       depends on RASPBERRYPI_FIRMWARE || (COMPILE_TEST && !RASPBERRYPI_FIRMWAR
+       depends on (RASPBERRYPI_FIRMWARE && !HIGHMEM) || (COMPILE_TEST && !RASPB
        default y
        help
                Support for Broadcom VideoCore services including