Message ID | 20190228071943.13072-1-l.stach@pengutronix.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | media: vb2-dc: skip CPU sync in map/unmap dma_buf | expand |
Pawel and/or Marek, can you take a look at this? It looks sane to me, but I'd like to have a second opinion as well before merging this. On 2/28/19 8:19 AM, Lucas Stach wrote: > This is rougly equivalent to ca0e68e21aae (drm/prime: skip CPU sync > in map/unmap dma_buf). The contig memory allocated is already device > coherent memory, so there is no point in doing a CPU sync when > mapping it to another device. Also most importers currently cache > the mapping so the CPU sync would only happen on the first import. > With that in mind we are better off with not pretending to do a > cache synchronization at all. > > This gets rid of a lot of CPU overhead in uses where those dma-bufs > are regularily imported and detached again, like Weston is currently > doing in the DRM compositor. Lucas, one question: shouldn't the same be done for dma-sg and vmalloc? Regards, Hans > > Signed-off-by: Lucas Stach <l.stach@pengutronix.de> > --- > drivers/media/common/videobuf2/videobuf2-dma-contig.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/drivers/media/common/videobuf2/videobuf2-dma-contig.c b/drivers/media/common/videobuf2/videobuf2-dma-contig.c > index aff0ab7bf83d..d38f097c14ae 100644 > --- a/drivers/media/common/videobuf2/videobuf2-dma-contig.c > +++ b/drivers/media/common/videobuf2/videobuf2-dma-contig.c > @@ -273,8 +273,8 @@ static void vb2_dc_dmabuf_ops_detach(struct dma_buf *dbuf, > > /* release the scatterlist cache */ > if (attach->dma_dir != DMA_NONE) > - dma_unmap_sg(db_attach->dev, sgt->sgl, sgt->orig_nents, > - attach->dma_dir); > + dma_unmap_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents, > + attach->dma_dir, DMA_ATTR_SKIP_CPU_SYNC); > sg_free_table(sgt); > kfree(attach); > db_attach->priv = NULL; > @@ -305,8 +305,8 @@ static struct sg_table *vb2_dc_dmabuf_ops_map( > } > > /* mapping to the client with new direction */ > - sgt->nents = dma_map_sg(db_attach->dev, sgt->sgl, sgt->orig_nents, > - dma_dir); > + sgt->nents = dma_map_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents, > + dma_dir, DMA_ATTR_SKIP_CPU_SYNC); > if (!sgt->nents) { > pr_err("failed to map scatterlist\n"); > mutex_unlock(lock); >
Hi Hans, Am Dienstag, den 12.03.2019, 08:57 +0100 schrieb Hans Verkuil: > Pawel and/or Marek, can you take a look at this? > > It looks sane to me, but I'd like to have a second opinion as well before > merging this. > > On 2/28/19 8:19 AM, Lucas Stach wrote: > > This is rougly equivalent to ca0e68e21aae (drm/prime: skip CPU sync > > in map/unmap dma_buf). The contig memory allocated is already device > > coherent memory, so there is no point in doing a CPU sync when > > mapping it to another device. Also most importers currently cache > > the mapping so the CPU sync would only happen on the first import. > > With that in mind we are better off with not pretending to do a > > cache synchronization at all. > > > > This gets rid of a lot of CPU overhead in uses where those dma-bufs > > are regularily imported and detached again, like Weston is currently > > doing in the DRM compositor. > > Lucas, one question: shouldn't the same be done for dma-sg and vmalloc? I think we should do it eventually, as the importer side caching of the mappings will undermine any cache maintenance we do in the dmabuf map/unmap calls. But then the argument is less clear cut for dma-sg and vmalloc as those buffers will most likely have valid allocations in the cache. dc is a very clear special case, as the device coherent allocations never have valid cache allocations, so dropping the synchronization is very low risk to break any existing use-case. Both dma-sg and vmalloc require a bit more thought to make the cache maintenance both effective and low overhead, so this patch is only a start by picking the low-hanging fruit. Regards, Lucas > > > > > > Signed-off-by: Lucas Stach <l.stach@pengutronix.de> > > --- > > drivers/media/common/videobuf2/videobuf2-dma-contig.c | 8 ++++---- > > 1 file changed, 4 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/media/common/videobuf2/videobuf2-dma-contig.c b/drivers/media/common/videobuf2/videobuf2-dma-contig.c > > index aff0ab7bf83d..d38f097c14ae 100644 > > --- a/drivers/media/common/videobuf2/videobuf2-dma-contig.c > > +++ b/drivers/media/common/videobuf2/videobuf2-dma-contig.c > > @@ -273,8 +273,8 @@ static void vb2_dc_dmabuf_ops_detach(struct dma_buf *dbuf, > > > > /* release the scatterlist cache */ > > if (attach->dma_dir != DMA_NONE) > > - dma_unmap_sg(db_attach->dev, sgt->sgl, sgt->orig_nents, > > - attach->dma_dir); > > + dma_unmap_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents, > > + attach->dma_dir, DMA_ATTR_SKIP_CPU_SYNC); > > sg_free_table(sgt); > > kfree(attach); > > db_attach->priv = NULL; > > @@ -305,8 +305,8 @@ static struct sg_table *vb2_dc_dmabuf_ops_map( > > } > > > > /* mapping to the client with new direction */ > > - sgt->nents = dma_map_sg(db_attach->dev, sgt->sgl, sgt->orig_nents, > > - dma_dir); > > + sgt->nents = dma_map_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents, > > + dma_dir, DMA_ATTR_SKIP_CPU_SYNC); > > if (!sgt->nents) { > > pr_err("failed to map scatterlist\n"); > > mutex_unlock(lock); > > > >
Hi Hans, Am Dienstag, den 12.03.2019, 11:03 +0100 schrieb Lucas Stach: > Hi Hans, > > Am Dienstag, den 12.03.2019, 08:57 +0100 schrieb Hans Verkuil: > > Pawel and/or Marek, can you take a look at this? Seems nobody else is interested in this patch. Would you be willing to take it without further review? Regards, Lucas > > It looks sane to me, but I'd like to have a second opinion as well before > > merging this. > > > > On 2/28/19 8:19 AM, Lucas Stach wrote: > > > This is rougly equivalent to ca0e68e21aae (drm/prime: skip CPU sync > > > in map/unmap dma_buf). The contig memory allocated is already device > > > coherent memory, so there is no point in doing a CPU sync when > > > mapping it to another device. Also most importers currently cache > > > the mapping so the CPU sync would only happen on the first import. > > > With that in mind we are better off with not pretending to do a > > > cache synchronization at all. > > > > > > This gets rid of a lot of CPU overhead in uses where those dma-bufs > > > are regularily imported and detached again, like Weston is currently > > > doing in the DRM compositor. > > > > Lucas, one question: shouldn't the same be done for dma-sg and vmalloc? > > I think we should do it eventually, as the importer side caching of the > mappings will undermine any cache maintenance we do in the dmabuf > map/unmap calls. But then the argument is less clear cut for dma-sg and > vmalloc as those buffers will most likely have valid allocations in the > cache. dc is a very clear special case, as the device coherent > allocations never have valid cache allocations, so dropping the > synchronization is very low risk to break any existing use-case. > > Both dma-sg and vmalloc require a bit more thought to make the cache > maintenance both effective and low overhead, so this patch is only a > start by picking the low-hanging fruit. > > Regards, > Lucas > > > > > > > > > Signed-off-by: Lucas Stach <l.stach@pengutronix.de> > > > > > > --- > > > drivers/media/common/videobuf2/videobuf2-dma-contig.c | 8 ++++---- > > > 1 file changed, 4 insertions(+), 4 deletions(-) > > > > > > diff --git a/drivers/media/common/videobuf2/videobuf2-dma-contig.c b/drivers/media/common/videobuf2/videobuf2-dma-contig.c > > > index aff0ab7bf83d..d38f097c14ae 100644 > > > --- a/drivers/media/common/videobuf2/videobuf2-dma-contig.c > > > +++ b/drivers/media/common/videobuf2/videobuf2-dma-contig.c > > > @@ -273,8 +273,8 @@ static void vb2_dc_dmabuf_ops_detach(struct dma_buf *dbuf, > > > > > > > > > /* release the scatterlist cache */ > > > > > > if (attach->dma_dir != DMA_NONE) > > > > > > - dma_unmap_sg(db_attach->dev, sgt->sgl, sgt->orig_nents, > > > > > > - attach->dma_dir); > > > > > > + dma_unmap_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents, > > > > > > + attach->dma_dir, DMA_ATTR_SKIP_CPU_SYNC); > > > > > > sg_free_table(sgt); > > > > > > kfree(attach); > > > > > > db_attach->priv = NULL; > > > @@ -305,8 +305,8 @@ static struct sg_table *vb2_dc_dmabuf_ops_map( > > > > > > } > > > > > > > > > /* mapping to the client with new direction */ > > > > > > - sgt->nents = dma_map_sg(db_attach->dev, sgt->sgl, sgt->orig_nents, > > > > > > - dma_dir); > > > > > > + sgt->nents = dma_map_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents, > > > > > > + dma_dir, DMA_ATTR_SKIP_CPU_SYNC); > > > > > > if (!sgt->nents) { > > > > > > pr_err("failed to map scatterlist\n"); > > > > > > mutex_unlock(lock); > > > > > > > > >
Hi Lucas, As you mentioned there hasn't been any further review comments, so it is fair not to postpone this. However, I have one new review comment myself that I would like to see addressed in a v2 before I merge this for 5.3: On 2/28/19 8:19 AM, Lucas Stach wrote: > This is rougly equivalent to ca0e68e21aae (drm/prime: skip CPU sync > in map/unmap dma_buf). The contig memory allocated is already device > coherent memory, so there is no point in doing a CPU sync when > mapping it to another device. Also most importers currently cache > the mapping so the CPU sync would only happen on the first import. > With that in mind we are better off with not pretending to do a > cache synchronization at all. > > This gets rid of a lot of CPU overhead in uses where those dma-bufs > are regularily imported and detached again, like Weston is currently > doing in the DRM compositor. > > Signed-off-by: Lucas Stach <l.stach@pengutronix.de> > --- > drivers/media/common/videobuf2/videobuf2-dma-contig.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/drivers/media/common/videobuf2/videobuf2-dma-contig.c b/drivers/media/common/videobuf2/videobuf2-dma-contig.c > index aff0ab7bf83d..d38f097c14ae 100644 > --- a/drivers/media/common/videobuf2/videobuf2-dma-contig.c > +++ b/drivers/media/common/videobuf2/videobuf2-dma-contig.c > @@ -273,8 +273,8 @@ static void vb2_dc_dmabuf_ops_detach(struct dma_buf *dbuf, > > /* release the scatterlist cache */ > if (attach->dma_dir != DMA_NONE) > - dma_unmap_sg(db_attach->dev, sgt->sgl, sgt->orig_nents, > - attach->dma_dir); > + dma_unmap_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents, > + attach->dma_dir, DMA_ATTR_SKIP_CPU_SYNC); Please add comments here... > sg_free_table(sgt); > kfree(attach); > db_attach->priv = NULL; > @@ -305,8 +305,8 @@ static struct sg_table *vb2_dc_dmabuf_ops_map( > } > > /* mapping to the client with new direction */ > - sgt->nents = dma_map_sg(db_attach->dev, sgt->sgl, sgt->orig_nents, > - dma_dir); > + sgt->nents = dma_map_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents, > + dma_dir, DMA_ATTR_SKIP_CPU_SYNC); ... and here to explain why you can skip the cpu sync. The comment in ops_detach can refer to the comment in ops_map, so there is no need to give the full explanation in two places. It is not obvious that you can skip the CPU sync, so an explanation is helpful. Regards, Hans > if (!sgt->nents) { > pr_err("failed to map scatterlist\n"); > mutex_unlock(lock); >
Hi Hans, Lucas, On Fri, May 3, 2019 at 9:38 PM Hans Verkuil <hverkuil@xs4all.nl> wrote: > > Hi Lucas, > > As you mentioned there hasn't been any further review comments, so > it is fair not to postpone this. > Sorry for being late to the party. I didn't notice this patch before. (Perhaps it could be worth adding me as a reviewer to the MAINTAINERS entries for these parts of the media subsystem?) > However, I have one new review comment myself that I would like to > see addressed in a v2 before I merge this for 5.3: > > On 2/28/19 8:19 AM, Lucas Stach wrote: > > This is rougly equivalent to ca0e68e21aae (drm/prime: skip CPU sync > > in map/unmap dma_buf). The contig memory allocated is already device > > coherent memory, so there is no point in doing a CPU sync when > > mapping it to another device. Also most importers currently cache > > the mapping so the CPU sync would only happen on the first import. > > With that in mind we are better off with not pretending to do a > > cache synchronization at all. > > > > This gets rid of a lot of CPU overhead in uses where those dma-bufs > > are regularily imported and detached again, like Weston is currently > > doing in the DRM compositor. > > > > Signed-off-by: Lucas Stach <l.stach@pengutronix.de> > > --- > > drivers/media/common/videobuf2/videobuf2-dma-contig.c | 8 ++++---- > > 1 file changed, 4 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/media/common/videobuf2/videobuf2-dma-contig.c b/drivers/media/common/videobuf2/videobuf2-dma-contig.c > > index aff0ab7bf83d..d38f097c14ae 100644 > > --- a/drivers/media/common/videobuf2/videobuf2-dma-contig.c > > +++ b/drivers/media/common/videobuf2/videobuf2-dma-contig.c > > @@ -273,8 +273,8 @@ static void vb2_dc_dmabuf_ops_detach(struct dma_buf *dbuf, > > > > /* release the scatterlist cache */ > > if (attach->dma_dir != DMA_NONE) > > - dma_unmap_sg(db_attach->dev, sgt->sgl, sgt->orig_nents, > > - attach->dma_dir); > > + dma_unmap_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents, > > + attach->dma_dir, DMA_ATTR_SKIP_CPU_SYNC); > > Please add comments here... > > > sg_free_table(sgt); > > kfree(attach); > > db_attach->priv = NULL; > > @@ -305,8 +305,8 @@ static struct sg_table *vb2_dc_dmabuf_ops_map( > > } > > > > /* mapping to the client with new direction */ > > - sgt->nents = dma_map_sg(db_attach->dev, sgt->sgl, sgt->orig_nents, > > - dma_dir); > > + sgt->nents = dma_map_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents, > > + dma_dir, DMA_ATTR_SKIP_CPU_SYNC); > > ... and here to explain why you can skip the cpu sync. The comment in ops_detach > can refer to the comment in ops_map, so there is no need to give the full > explanation in two places. > > It is not obvious that you can skip the CPU sync, so an explanation is helpful. The change makes sense indeed, thanks. With Hans' suggestion addressed: Reviewed-by: Tomasz Figa <tfiga@chromium.org> In the bigger perspective, the DMA-buf and DMA API "map" semantics are kind of confusing themselves, because they imply "sync" by default. I wonder what it would take to just completely split "sync" from "map" in both APIs. It's even worse with DMA-buf, because you cannot ask dma_buf_map_attachment() not to skip the sync for you, nor you cannot sync again without unmapping and mapping again... Best regards, Tomasz
On 6/7/19 6:32 AM, Tomasz Figa wrote: > Hi Hans, Lucas, > > On Fri, May 3, 2019 at 9:38 PM Hans Verkuil <hverkuil@xs4all.nl> wrote: >> >> Hi Lucas, >> >> As you mentioned there hasn't been any further review comments, so >> it is fair not to postpone this. >> > > Sorry for being late to the party. I didn't notice this patch before. > (Perhaps it could be worth adding me as a reviewer to the MAINTAINERS > entries for these parts of the media subsystem?) Post a patch for that. I'd like to have more reviewers for patches in this area since it is not really my area of expertise. Regards, Hans
diff --git a/drivers/media/common/videobuf2/videobuf2-dma-contig.c b/drivers/media/common/videobuf2/videobuf2-dma-contig.c index aff0ab7bf83d..d38f097c14ae 100644 --- a/drivers/media/common/videobuf2/videobuf2-dma-contig.c +++ b/drivers/media/common/videobuf2/videobuf2-dma-contig.c @@ -273,8 +273,8 @@ static void vb2_dc_dmabuf_ops_detach(struct dma_buf *dbuf, /* release the scatterlist cache */ if (attach->dma_dir != DMA_NONE) - dma_unmap_sg(db_attach->dev, sgt->sgl, sgt->orig_nents, - attach->dma_dir); + dma_unmap_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents, + attach->dma_dir, DMA_ATTR_SKIP_CPU_SYNC); sg_free_table(sgt); kfree(attach); db_attach->priv = NULL; @@ -305,8 +305,8 @@ static struct sg_table *vb2_dc_dmabuf_ops_map( } /* mapping to the client with new direction */ - sgt->nents = dma_map_sg(db_attach->dev, sgt->sgl, sgt->orig_nents, - dma_dir); + sgt->nents = dma_map_sg_attrs(db_attach->dev, sgt->sgl, sgt->orig_nents, + dma_dir, DMA_ATTR_SKIP_CPU_SYNC); if (!sgt->nents) { pr_err("failed to map scatterlist\n"); mutex_unlock(lock);
This is rougly equivalent to ca0e68e21aae (drm/prime: skip CPU sync in map/unmap dma_buf). The contig memory allocated is already device coherent memory, so there is no point in doing a CPU sync when mapping it to another device. Also most importers currently cache the mapping so the CPU sync would only happen on the first import. With that in mind we are better off with not pretending to do a cache synchronization at all. This gets rid of a lot of CPU overhead in uses where those dma-bufs are regularily imported and detached again, like Weston is currently doing in the DRM compositor. Signed-off-by: Lucas Stach <l.stach@pengutronix.de> --- drivers/media/common/videobuf2/videobuf2-dma-contig.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)