Message ID | 20210108105241.1757799-1-misono.tomohiro@jp.fujitsu.com (mailing list archive) |
---|---|
Headers | show |
Series | Add Fujitsu A64FX soc entry/hardware barrier driver | expand |
On Fri, Jan 08, 2021 at 07:52:31PM +0900, Misono Tomohiro wrote: > (Resend as cover letter title was missing in the first time. Sorry for noise) > > Hello, Hi, > This series adds Fujitsu A64FX SoC entry in drivers/soc and hardware > barrier driver for it. > > [Driver Description] > A64FX CPU has several functions for HPC workload and hardware barrier > is one of them. It is a mechanism to realize fast synchronization by > PEs belonging to the same L3 cache domain by using implementation > defined hardware registers. > For more details, see A64FX HPC extension specification in > https://github.com/fujitsu/A64FX > > The driver mainly offers a set of ioctls to manipulate related registers. > Patch 1-9 implements driver code and patch 10 finally adds kconfig, > Makefile and MAINTAINER entry for the driver. I have a number of concerns here, and at a high level, I do not think that this is something Linux can reasonably support in its current form. Sorry if this comes across as harsh; I appreciate the work that has gone into this, and the effort to try to upstream support is great -- my concerns are with the overal picture. As a general rule, we avoid the use of IMPLEMENTATION DEFINED features in Linux, as they pose a number of correctness/safety challenges and come with a potentially significan long term maintenance burden that is generally not justified by the features themselves. For example, such features are not usable under virtualization (where a hypervisor may set HCR_EL2.TIDCP, or fail to context-switch state that it is unaware of). Secondly, the intended usage model appears to expose this to EL0 for direct access, and the code seems to depend on threads being pinned, but AFAICT this is not enforced and there is no provision for context-switch, thread migration, or interaction with ptrace. I fear this is going to be very fragile in practice, and that extending that support in future will require much more complexity than is currently apparent, with potentially invasive changes to arch code. Thirdly, this requires userspace software to be intimately familiar with the HW platform that it is running on (both in terms of using IMP-DEF instructions and needing to know the physical layout), rather than being generic and portable, which I don't believe is something that we wish to encourage. I also think this is unlikely to be supported by generic software because of the lack of portability, and consequently I struggle to beleive that this will see significant usage. Further, as an IMP-DEF feature, it's not clear how much of this will carry forward into future designs, and where things may change. It's extremely difficult to determine whether any of the ABI decisions (e.g. the sysfs layout) are sufficient, or what level of changes would be necessary in userspace code if there are physical topology changes or changes to the strucutre of the system register interfaces. Overall, I think this needs much more justification in terms of practical usage, safety/correctness, and long term maintenance, and with that I think the longer term goal would be to use this to justify an architectural feature along similar lines rather than to support any IMPLEMENTATION DEFINED variants upstream in Linux. > Also, C library and test program for this driver is available on: > https://github.com/fujitsu/hardware_barrier Hmm... I see some code in that repo which looks suspiciously like code from the Linux kernel tree, but licensed differently, which is concerning. Specifically, the asm block in internal.h (which the SPDX headers says is licensed as LGPL-3.0-only) looks like a copy of code from arch/arm64/include/asm/sysreg.h (which is licensed as GPL-2.0-only per its SPDX header). If that code was copied, I don't believe that relicensing is permitted. I would advise that someone with legal training considers the provenance of that code and what is permitted. Thanks, Mark. > The driver is based on v5.11-rc2 and tested on FX700 environment. > > [RFC] > This is the first time we upstream drivers for our chip and I want to > confirm driver location and patch submission process. > > Based on my observation it seems drivers/soc folder is right place to put > this driver, so I added Kconfig entry for arm64 platform config, created > soc/fujitsu folder and updated MAINTAINER entry accordingly (last patch). > Is it right? > > Also for final submission I think I need to 1) create some public git > tree to push driver code (github or something), 2) make pull request to > SOC team (soc@kernel.org). Is it a correct procedure? > > I will appreciate any help/comments. > > sidenote: We plan to post other drivers for A64FX HPC extension > (prefetch control and cache control) too anytime soon. > > Misono Tomohiro (10): > soc: fujitsu: hwb: Add hardware barrier driver init/exit code > soc: fujtisu: hwb: Add open operation > soc: fujitsu: hwb: Add IOC_BB_ALLOC ioctl > soc: fujitsu: hwb: Add IOC_BW_ASSIGN ioctl > soc: fujitsu: hwb: Add IOC_BW_UNASSIGN ioctl > soc: fujitsu: hwb: Add IOC_BB_FREE ioctl > soc: fujitsu: hwb: Add IOC_GET_PE_INFO ioctl > soc: fujitsu: hwb: Add release operation > soc: fujitsu: hwb: Add sysfs entry > soc: fujitsu: hwb: Add Kconfig/Makefile to build fujitsu_hwb driver > > MAINTAINERS | 7 + > arch/arm64/Kconfig.platforms | 5 + > drivers/soc/Kconfig | 1 + > drivers/soc/Makefile | 1 + > drivers/soc/fujitsu/Kconfig | 24 + > drivers/soc/fujitsu/Makefile | 2 + > drivers/soc/fujitsu/fujitsu_hwb.c | 1253 ++++++++++++++++++++++++ > include/uapi/linux/fujitsu_hpc_ioctl.h | 41 + > 8 files changed, 1334 insertions(+) > create mode 100644 drivers/soc/fujitsu/Kconfig > create mode 100644 drivers/soc/fujitsu/Makefile > create mode 100644 drivers/soc/fujitsu/fujitsu_hwb.c > create mode 100644 include/uapi/linux/fujitsu_hpc_ioctl.h > > -- > 2.26.2 > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Fri, Jan 8, 2021 at 1:54 PM Mark Rutland <mark.rutland@arm.com> wrote: > On Fri, Jan 08, 2021 at 07:52:31PM +0900, Misono Tomohiro wrote: > > (Resend as cover letter title was missing in the first time. Sorry for noise) > > > > This series adds Fujitsu A64FX SoC entry in drivers/soc and hardware > > barrier driver for it. > > > > [Driver Description] > > A64FX CPU has several functions for HPC workload and hardware barrier > > is one of them. It is a mechanism to realize fast synchronization by > > PEs belonging to the same L3 cache domain by using implementation > > defined hardware registers. > > For more details, see A64FX HPC extension specification in > > https://github.com/fujitsu/A64FX > > > > The driver mainly offers a set of ioctls to manipulate related registers. > > Patch 1-9 implements driver code and patch 10 finally adds kconfig, > > Makefile and MAINTAINER entry for the driver. > > I have a number of concerns here, and at a high level, I do not think > that this is something Linux can reasonably support in its current form. > Sorry if this comes across as harsh; I appreciate the work that has gone > into this, and the effort to try to upstream support is great -- my > concerns are with the overal picture. > > As a general rule, we avoid the use of IMPLEMENTATION DEFINED features > in Linux, as they pose a number of correctness/safety challenges and > come with a potentially significan long term maintenance burden that is > generally not justified by the features themselves. For example, such > features are not usable under virtualization (where a hypervisor may set > HCR_EL2.TIDCP, or fail to context-switch state that it is unaware of). I am somewhat less concerned about the feature being implementation defined than I am about adding a custom user interface for one platform. In the end, anything outside of the CPU core that ends up in a SoC is implementation defined, and this is usually not a problem as long as we have an abstraction in the kernel that hides the details from the user, and the system is still functional if the implementation is turned off for whatever reason. > Secondly, the intended usage model appears to expose this to EL0 for > direct access, and the code seems to depend on threads being pinned, but > AFAICT this is not enforced and there is no provision for > context-switch, thread migration, or interaction with ptrace. I fear > this is going to be very fragile in practice, and that extending that > support in future will require much more complexity than is currently > apparent, with potentially invasive changes to arch code. Right, this is the main problem I see, too. I had not even realized that this will have to tie in with user space threads in some form, but you are right that once this has to interact with the CPU scheduler, it all breaks down. One way I can imagine this working out is to tie into the cpuset mechanism that is used for isolating threads to CPU cores, and then provide a cpuset interface that has the desired behavior but that can fall back to a generic implementation with the same or stronger (but normally slower) semantics. > Thirdly, this requires userspace software to be intimately familiar with > the HW platform that it is running on (both in terms of using IMP-DEF > instructions and needing to know the physical layout), rather than being > generic and portable, which I don't believe is something that we wish to > encourage. I also think this is unlikely to be supported by generic > software because of the lack of portability, and consequently I struggle > to beleive that this will see significant usage. Agreed as well. Arnd
On Fri, Jan 08, 2021 at 03:23:23PM +0100, Arnd Bergmann wrote: > On Fri, Jan 8, 2021 at 1:54 PM Mark Rutland <mark.rutland@arm.com> wrote: > > As a general rule, we avoid the use of IMPLEMENTATION DEFINED features > > in Linux, as they pose a number of correctness/safety challenges and > > come with a potentially significan long term maintenance burden that is > > generally not justified by the features themselves. For example, such > > features are not usable under virtualization (where a hypervisor may set > > HCR_EL2.TIDCP, or fail to context-switch state that it is unaware of). > > I am somewhat less concerned about the feature being implementation > defined than I am about adding a custom user interface for one > platform. I completely agree that adding a custom interface that's platform specific is undesireable. > In the end, anything outside of the CPU core that ends up in a SoC > is implementation defined, and this is usually not a problem as long > as we have an abstraction in the kernel that hides the details from > the user, and the system is still functional if the implementation is > turned off for whatever reason. I think that peripherals and other bits out in the SoC are quite different to things built into the CPU, where there's inevitably most significant and subtle interactions with the architecture, and can be so closely coupled as to not have a good point to apply abstraction. We have common ways of abstracting storage devices, but the same is not as true for userspace instructions. Thanks, Mark.
Hi, First of all, thanks a lot for all the comments to both of you (cont. below). > On Fri, Jan 8, 2021 at 1:54 PM Mark Rutland <mark.rutland@arm.com> wrote: > > On Fri, Jan 08, 2021 at 07:52:31PM +0900, Misono Tomohiro wrote: > > > (Resend as cover letter title was missing in the first time. Sorry for noise) > > > > > > This series adds Fujitsu A64FX SoC entry in drivers/soc and hardware > > > barrier driver for it. > > > > > > [Driver Description] > > > A64FX CPU has several functions for HPC workload and hardware barrier > > > is one of them. It is a mechanism to realize fast synchronization by > > > PEs belonging to the same L3 cache domain by using implementation > > > defined hardware registers. > > > For more details, see A64FX HPC extension specification in > > > https://github.com/fujitsu/A64FX > > > > > > The driver mainly offers a set of ioctls to manipulate related registers. > > > Patch 1-9 implements driver code and patch 10 finally adds kconfig, > > > Makefile and MAINTAINER entry for the driver. > > > > I have a number of concerns here, and at a high level, I do not think > > that this is something Linux can reasonably support in its current form. > > Sorry if this comes across as harsh; I appreciate the work that has gone > > into this, and the effort to try to upstream support is great -- my > > concerns are with the overal picture. > > > > As a general rule, we avoid the use of IMPLEMENTATION DEFINED features > > in Linux, as they pose a number of correctness/safety challenges and > > come with a potentially significan long term maintenance burden that is > > generally not justified by the features themselves. For example, such > > features are not usable under virtualization (where a hypervisor may set > > HCR_EL2.TIDCP, or fail to context-switch state that it is unaware of). > > I am somewhat less concerned about the feature being implementation > defined than I am about adding a custom user interface for one > platform. > > In the end, anything outside of the CPU core that ends up in a SoC > is implementation defined, and this is usually not a problem as long > as we have an abstraction in the kernel that hides the details from > the user, and the system is still functional if the implementation is > turned off for whatever reason. Understood. However, I don't know any other processors having similar features at this point and it is hard to provide common abstraction interface. I would appreciate should anyone have any information. > > Secondly, the intended usage model appears to expose this to EL0 for > > direct access, and the code seems to depend on threads being pinned, but > > AFAICT this is not enforced and there is no provision for > > context-switch, thread migration, or interaction with ptrace. I fear > > this is going to be very fragile in practice, and that extending that > > support in future will require much more complexity than is currently > > apparent, with potentially invasive changes to arch code. > > Right, this is the main problem I see, too. I had not even realized > that this will have to tie in with user space threads in some form, but > you are right that once this has to interact with the CPU scheduler, > it all breaks down. This observation is right. I thought adding context switch etc. support for implementation defined registers requires core arch code changes and it is far less acceptable. So, I tried to confine code change in a module with these restrictions. Regarding direct access from EL0, it is necessary for realizing fast synchronization as this enables synchronization logic in user application check if all threads have reached at synchronization point without switching to kernel. Also, It is common usage that each running thread is bound to one PE in multi-threaded HPC applications. > One way I can imagine this working out is to tie into the cpuset > mechanism that is used for isolating threads to CPU cores, and > then provide a cpuset interface that has the desired behavior > but that can fall back to a generic implementation with the same > or stronger (but normally slower) semantics. I'm not sure if this approach is feasible, but I will try to look into it. > > Thirdly, this requires userspace software to be intimately familiar with > > the HW platform that it is running on (both in terms of using IMP-DEF > > instructions and needing to know the physical layout), rather than being > > generic and portable, which I don't believe is something that we wish to > > encourage. I also think this is unlikely to be supported by generic > > software because of the lack of portability, and consequently I struggle > > to beleive that this will see significant usage. > > Agreed as well. It may be possible to trap access to these implementation defined registers and fallback some logic in the driver. The problem is that other processors might use the same IMP-DEF registers for different purpose. Regards, Tomohiro
> On Fri, Jan 08, 2021 at 07:52:31PM +0900, Misono Tomohiro wrote: > > (Resend as cover letter title was missing in the first time. Sorry for noise) > > > > Hello, > > Hi, > > > This series adds Fujitsu A64FX SoC entry in drivers/soc and hardware > > barrier driver for it. > > > > [Driver Description] > > A64FX CPU has several functions for HPC workload and hardware barrier > > is one of them. It is a mechanism to realize fast synchronization by > > PEs belonging to the same L3 cache domain by using implementation > > defined hardware registers. > > For more details, see A64FX HPC extension specification in > > https://github.com/fujitsu/A64FX > > > > The driver mainly offers a set of ioctls to manipulate related registers. > > Patch 1-9 implements driver code and patch 10 finally adds kconfig, > > Makefile and MAINTAINER entry for the driver. > > I have a number of concerns here, and at a high level, I do not think > that this is something Linux can reasonably support in its current form. > Sorry if this comes across as harsh; I appreciate the work that has gone > into this, and the effort to try to upstream support is great -- my > concerns are with the overal picture. > > As a general rule, we avoid the use of IMPLEMENTATION DEFINED features > in Linux, as they pose a number of correctness/safety challenges and > come with a potentially significan long term maintenance burden that is > generally not justified by the features themselves. For example, such > features are not usable under virtualization (where a hypervisor may set > HCR_EL2.TIDCP, or fail to context-switch state that it is unaware of). > > Secondly, the intended usage model appears to expose this to EL0 for > direct access, and the code seems to depend on threads being pinned, but > AFAICT this is not enforced and there is no provision for > context-switch, thread migration, or interaction with ptrace. I fear > this is going to be very fragile in practice, and that extending that > support in future will require much more complexity than is currently > apparent, with potentially invasive changes to arch code. > > Thirdly, this requires userspace software to be intimately familiar with > the HW platform that it is running on (both in terms of using IMP-DEF > instructions and needing to know the physical layout), rather than being > generic and portable, which I don't believe is something that we wish to > encourage. I also think this is unlikely to be supported by generic > software because of the lack of portability, and consequently I struggle > to beleive that this will see significant usage. > > Further, as an IMP-DEF feature, it's not clear how much of this will > carry forward into future designs, and where things may change. It's > extremely difficult to determine whether any of the ABI decisions (e.g. > the sysfs layout) are sufficient, or what level of changes would be > necessary in userspace code if there are physical topology changes or > changes to the strucutre of the system register interfaces. > > Overall, I think this needs much more justification in terms of > practical usage, safety/correctness, and long term maintenance, and with > that I think the longer term goal would be to use this to justify an > architectural feature along similar lines rather than to support any > IMPLEMENTATION DEFINED variants upstream in Linux. > > > Also, C library and test program for this driver is available on: > > https://github.com/fujitsu/hardware_barrier > > Hmm... I see some code in that repo which looks suspiciously like code > from the Linux kernel tree, but licensed differently, which is > concerning. > > Specifically, the asm block in internal.h (which the SPDX headers says > is licensed as LGPL-3.0-only) looks like a copy of code from > arch/arm64/include/asm/sysreg.h (which is licensed as GPL-2.0-only per > its SPDX header). > > If that code was copied, I don't believe that relicensing is permitted. > I would advise that someone with legal training considers the provenance > of that code and what is permitted. Sorry, I must have lacked the attention where the code comes from when I wrote the code. I have removed that part to write assemby directly: https://github.com/fujitsu/hardware_barrier/blob/develop/src/internal.h https://github.com/fujitsu/hardware_barrier/blob/develop/src/hwblib.c#L215 Regards, Tomohiro
On Tue, Jan 12, 2021 at 11:24 AM misono.tomohiro@fujitsu.com <misono.tomohiro@fujitsu.com> wrote: > > On Fri, Jan 8, 2021 at 1:54 PM Mark Rutland <mark.rutland@arm.com> wrote: > However, I don't know any other processors having similar > features at this point and it is hard to provide common abstraction interface. > I would appreciate should anyone have any information. The specification you pointed to mentions the SPARC64 XIfx, so at a minimum, a user interface should be designed to also work on whatever register-level interface that provides. > > > Secondly, the intended usage model appears to expose this to EL0 for > > > direct access, and the code seems to depend on threads being pinned, but > > > AFAICT this is not enforced and there is no provision for > > > context-switch, thread migration, or interaction with ptrace. I fear > > > this is going to be very fragile in practice, and that extending that > > > support in future will require much more complexity than is currently > > > apparent, with potentially invasive changes to arch code. > > > > Right, this is the main problem I see, too. I had not even realized > > that this will have to tie in with user space threads in some form, but > > you are right that once this has to interact with the CPU scheduler, > > it all breaks down. > > This observation is right. I thought adding context switch etc. support for > implementation defined registers requires core arch code changes and > it is far less acceptable. So, I tried to confine code change in a module with > these restrictions. My feeling is that having the code separate from where it would belong in an operating system that was designed specifically for this feature ends up being no better than rewriting the core scheduling code. As Mark said, it may well be that neither approach would be sufficient for an upstream merge. On the other hand, keeping the code in a separate loadable module does make most sense if we end up not merging it at all, in which case this is the easiest to port between kernel versions. > Regarding direct access from EL0, it is necessary for realizing fast synchronization > as this enables synchronization logic in user application check if all threads have > reached at synchronization point without switching to kernel. Ok, I see. > Also, It is common usage that each running thread is bound to one PE in > multi-threaded HPC applications. I think the expectation that all threads are bound to a physical CPU makes sense for using this feature, but I think it would be necessary to enforce that, e.g. by allowing only threads to enable it after they are isolated to a non-shared CPU, and automatically disabling it if the CPU isolation is changed. For the user space interface, something based on process IDs seems to make more sense to me than something based on CPU numbers. All of the above does require some level of integration with the core kernel of course. I think the next step would be to try to come up with a high-level user interface design that has a chance to get merged, rather than addressing the review comments for the current implementation. Aside from the user interface question, it would be good to understand the performance impact of the feature. As I understand it, the entire purpose is to make things faster, so to put it in perspective compared to the burden of adding an interface, there should be some numbers: What are the kinds of applications that would use it in practice, and how much faster are they compared to not having it? Arnd
> On Tue, Jan 12, 2021 at 11:24 AM misono.tomohiro@fujitsu.com > <misono.tomohiro@fujitsu.com> wrote: > > > On Fri, Jan 8, 2021 at 1:54 PM Mark Rutland <mark.rutland@arm.com> wrote: > > However, I don't know any other processors having similar > > features at this point and it is hard to provide common abstraction interface. > > I would appreciate should anyone have any information. > > The specification you pointed to mentions the SPARC64 XIfx, so > at a minimum, a user interface should be designed to also work on > whatever register-level interface that provides. Those our previous CPUs have hardware barrier function too, but they are not currently used (I believe the hardware design shares common idea and this driver logic/ioctl interface could be applicable to both). > > > > Secondly, the intended usage model appears to expose this to EL0 for > > > > direct access, and the code seems to depend on threads being pinned, but > > > > AFAICT this is not enforced and there is no provision for > > > > context-switch, thread migration, or interaction with ptrace. I fear > > > > this is going to be very fragile in practice, and that extending that > > > > support in future will require much more complexity than is currently > > > > apparent, with potentially invasive changes to arch code. > > > > > > Right, this is the main problem I see, too. I had not even realized > > > that this will have to tie in with user space threads in some form, but > > > you are right that once this has to interact with the CPU scheduler, > > > it all breaks down. > > > > This observation is right. I thought adding context switch etc. support for > > implementation defined registers requires core arch code changes and > > it is far less acceptable. So, I tried to confine code change in a module with > > these restrictions. > > My feeling is that having the code separate from where it would belong > in an operating system that was designed specifically for this feature > ends up being no better than rewriting the core scheduling code. > > As Mark said, it may well be that neither approach would be sufficient > for an upstream merge. On the other hand, keeping the code in a > separate loadable module does make most sense if we end up > not merging it at all, in which case this is the easiest to port > between kernel versions. > > > Regarding direct access from EL0, it is necessary for realizing fast synchronization > > as this enables synchronization logic in user application check if all threads have > > reached at synchronization point without switching to kernel. > > Ok, I see. > > > Also, It is common usage that each running thread is bound to one PE in > > multi-threaded HPC applications. > > I think the expectation that all threads are bound to a physical CPU > makes sense for using this feature, but I think it would be necessary > to enforce that, e.g. by allowing only threads to enable it after they > are isolated to a non-shared CPU, and automatically disabling it > if the CPU isolation is changed. > > For the user space interface, something based on process IDs > seems to make more sense to me than something based on CPU > numbers. All of the above does require some level of integration > with the core kernel of course. > > I think the next step would be to try to come up with a high-level > user interface design that has a chance to get merged, rather than > addressing the review comments for the current implementation. Understood. One question is that high-level interface such as process based control could solve several problems (i.e. access control/force binding), I cannot eliminate access to IMP-DEF registers from EL0 as I exaplained above. Is it acceptable in your sense? > Aside from the user interface question, it would be good to > understand the performance impact of the feature. > As I understand it, the entire purpose is to make things faster, so > to put it in perspective compared to the burden of adding an > interface, there should be some numbers: What are the kinds of > applications that would use it in practice, and how much faster are > they compared to not having it? Microbenchmark shows it takes around 250ns for 1 synchronization for 12 PEs with hardware barrier and it is multiple times faster than software barrier (only measuring core synchronization logic and excluding setup time). I don't have application results at this point and will share when I could get some. Regards, Tomohiro
On Fri, Jan 15, 2021 at 12:10 PM misono.tomohiro@fujitsu.com <misono.tomohiro@fujitsu.com> wrote: > > On Tue, Jan 12, 2021 at 11:24 AM misono.tomohiro@fujitsu.com <misono.tomohiro@fujitsu.com> wrote: > > > Also, It is common usage that each running thread is bound to one PE in > > > multi-threaded HPC applications. > > > > I think the expectation that all threads are bound to a physical CPU > > makes sense for using this feature, but I think it would be necessary > > to enforce that, e.g. by allowing only threads to enable it after they > > are isolated to a non-shared CPU, and automatically disabling it > > if the CPU isolation is changed. > > > > For the user space interface, something based on process IDs > > seems to make more sense to me than something based on CPU > > numbers. All of the above does require some level of integration > > with the core kernel of course. > > > > I think the next step would be to try to come up with a high-level > > user interface design that has a chance to get merged, rather than > > addressing the review comments for the current implementation. > > Understood. One question is that high-level interface such as process > based control could solve several problems (i.e. access control/force binding), > I cannot eliminate access to IMP-DEF registers from EL0 as I explained > above. Is it acceptable in your sense? I think you will get different answers for that depending on who you ask ;-) I'm generally ok with it, given that it will only affect a very small number of specialized applications that are already built for a specific microarchitecture for performance reasons. E.g. when using an arm64 BLAS library, you would use different versions of the same functions depending on CPU support for NEON, SVE, SVE2, Apple AMX (which also uses imp-def instructions), ARMv8.6 GEMM extensions, and likely a hand-optimized version for the A64FX pipeline. Having a version for A64FX with hardware barriers adds (at most) one more code path but hopefully does not add complexity to the common code. > > Aside from the user interface question, it would be good to > > understand the performance impact of the feature. > > As I understand it, the entire purpose is to make things faster, so > > to put it in perspective compared to the burden of adding an > > interface, there should be some numbers: What are the kinds of > > applications that would use it in practice, and how much faster are > > they compared to not having it? > > Microbenchmark shows it takes around 250ns for 1 synchronization for > 12 PEs with hardware barrier and it is multiple times faster than software > barrier (only measuring core synchronization logic and excluding setup time). > I don't have application results at this point and will share when I could get some. Thanks. That will be helpful indeed. Please also include information about what you are comparing against for the software barrier. E.g. Is that based on a futex() system call, or completely implemented in user space? Arnd
> > > > Also, It is common usage that each running thread is bound to one PE in > > > > multi-threaded HPC applications. > > > > > > I think the expectation that all threads are bound to a physical CPU > > > makes sense for using this feature, but I think it would be necessary > > > to enforce that, e.g. by allowing only threads to enable it after they > > > are isolated to a non-shared CPU, and automatically disabling it > > > if the CPU isolation is changed. > > > > > > For the user space interface, something based on process IDs > > > seems to make more sense to me than something based on CPU > > > numbers. All of the above does require some level of integration > > > with the core kernel of course. > > > > > > I think the next step would be to try to come up with a high-level > > > user interface design that has a chance to get merged, rather than > > > addressing the review comments for the current implementation. > > > > Understood. One question is that high-level interface such as process > > based control could solve several problems (i.e. access control/force binding), > > I cannot eliminate access to IMP-DEF registers from EL0 as I explained > > above. Is it acceptable in your sense? > > I think you will get different answers for that depending on who you ask ;-) > > I'm generally ok with it, given that it will only affect a very small > number of specialized applications that are already built for > a specific microarchitecture for performance reasons. E.g. when > using an arm64 BLAS library, you would use different versions > of the same functions depending on CPU support for NEON, > SVE, SVE2, Apple AMX (which also uses imp-def instructions), > ARMv8.6 GEMM extensions, and likely a hand-optimized > version for the A64FX pipeline. Having a version for A64FX with > hardware barriers adds (at most) one more code path but hopefully > does not add complexity to the common code. Thanks. Btw, to be precise, A64FX doesn't use imp-def instructions. It provides imp-def registers which can be accessed by system register access instructions (msr/mrs). > > > Aside from the user interface question, it would be good to > > > understand the performance impact of the feature. > > > As I understand it, the entire purpose is to make things faster, so > > > to put it in perspective compared to the burden of adding an > > > interface, there should be some numbers: What are the kinds of > > > applications that would use it in practice, and how much faster are > > > they compared to not having it? > > > > Microbenchmark shows it takes around 250ns for 1 synchronization for > > 12 PEs with hardware barrier and it is multiple times faster than software > > barrier (only measuring core synchronization logic and excluding setup time). > > I don't have application results at this point and will share when I could get some. > > Thanks. That will be helpful indeed. Please also include information > about what you are comparing against for the software barrier. E.g. > Is that based on a futex() system call, or completely implemented > in user space? It completely implemented in user space by using shared variables without system call. (As all PEs to be synced shares L3, it should cause to access to L3.) Regards, Tomohiro
> > > > Also, It is common usage that each running thread is bound to one PE in > > > > multi-threaded HPC applications. > > > > > > I think the expectation that all threads are bound to a physical CPU > > > makes sense for using this feature, but I think it would be necessary > > > to enforce that, e.g. by allowing only threads to enable it after they > > > are isolated to a non-shared CPU, and automatically disabling it > > > if the CPU isolation is changed. > > > > > > For the user space interface, something based on process IDs > > > seems to make more sense to me than something based on CPU > > > numbers. All of the above does require some level of integration > > > with the core kernel of course. > > > > > > I think the next step would be to try to come up with a high-level > > > user interface design that has a chance to get merged, rather than > > > addressing the review comments for the current implementation. Hello, Sorry for late response but while thinking new approaches, I come up with some different idea and want to hear your opinions. How about offload all control to user space while the driver just offers read/write access to the needed registers? Let me explain in detail. Although I searched similar functions in other products, I could not find it. Also, this hardware barrier performs intra-numa synchronization and it is hard to be used for general inter-process barrier. So I think generalizing this feature in kernel does not go well. As I said this is mainly for HPC application. In the usual situations, the user has full control of the PC nodes when running HPC application and thus the user has full responsibility of running processes on the machine. Offloading all controls to these registers to the user is acceptable in that case (i.e. the driver just offers access to the registers and does not control it). This is the safe for the kernel operation as manipulating barrier related registers just affects user application. In this approach we could remove ioctls or control logic in the driver but we need some way to access the needed registers. I firstly think if I can use x86's MSR driver like approach but I know the idea is rejected recently for security concerns: https://lore.kernel.org/linux-arm-kernel/20201130174833.41315-1-rongwei.wang@linux.alibaba.com/ Based on these observations, I have two ideas currently: 1) make the driver to only expose sysfs interface for reading/writing A64FX's barrier registers or 2) generalizing (1) in some way; To make some mechanism to expose CPU defined registers which can be safely accessed from user space Are these idea acceptable ways to explore to get merged in upstream? I'd appreciate any criticism/comments. Regards, Tomohiro
Hi, Gentle Ping? Tomohiro > Subject: RE: [RFC PATCH 00/10] Add Fujitsu A64FX soc entry/hardware barrier driver > > > > > > Also, It is common usage that each running thread is bound to one PE in > > > > > multi-threaded HPC applications. > > > > > > > > I think the expectation that all threads are bound to a physical CPU > > > > makes sense for using this feature, but I think it would be necessary > > > > to enforce that, e.g. by allowing only threads to enable it after they > > > > are isolated to a non-shared CPU, and automatically disabling it > > > > if the CPU isolation is changed. > > > > > > > > For the user space interface, something based on process IDs > > > > seems to make more sense to me than something based on CPU > > > > numbers. All of the above does require some level of integration > > > > with the core kernel of course. > > > > > > > > I think the next step would be to try to come up with a high-level > > > > user interface design that has a chance to get merged, rather than > > > > addressing the review comments for the current implementation. > > Hello, > > Sorry for late response but while thinking new approaches, I come up with > some different idea and want to hear your opinions. How about offload > all control to user space while the driver just offers read/write access > to the needed registers? Let me explain in detail. > > Although I searched similar functions in other products, I could not find > it. Also, this hardware barrier performs intra-numa synchronization and > it is hard to be used for general inter-process barrier. So I think > generalizing this feature in kernel does not go well. > > As I said this is mainly for HPC application. In the usual situations, the > user has full control of the PC nodes when running HPC application and > thus the user has full responsibility of running processes on the machine. > Offloading all controls to these registers to the user is acceptable in that > case (i.e. the driver just offers access to the registers and does not control it). > This is the safe for the kernel operation as manipulating barrier related > registers just affects user application. > > In this approach we could remove ioctls or control logic in the driver but > we need some way to access the needed registers. I firstly think if I can > use x86's MSR driver like approach but I know the idea is rejected > recently for security concerns: > https://lore.kernel.org/linux-arm-kernel/20201130174833.41315-1-rongwei.wang@linux.alibaba.com/ > > Based on these observations, I have two ideas currently: > 1) make the driver to only expose sysfs interface for reading/writing > A64FX's barrier registers > or > 2) generalizing (1) in some way; To make some mechanism to expose > CPU defined registers which can be safely accessed from user space > > Are these idea acceptable ways to explore to get merged in upstream? > I'd appreciate any criticism/comments. > > Regards, > Tomohiro
On Thu, Feb 18, 2021 at 10:49 AM misono.tomohiro@fujitsu.com <misono.tomohiro@fujitsu.com> wrote: > > > > > > Also, It is common usage that each running thread is bound to one PE in > > > > > multi-threaded HPC applications. > > > > > > > > I think the expectation that all threads are bound to a physical CPU > > > > makes sense for using this feature, but I think it would be necessary > > > > to enforce that, e.g. by allowing only threads to enable it after they > > > > are isolated to a non-shared CPU, and automatically disabling it > > > > if the CPU isolation is changed. > > > > > > > > For the user space interface, something based on process IDs > > > > seems to make more sense to me than something based on CPU > > > > numbers. All of the above does require some level of integration > > > > with the core kernel of course. > > > > > > > > I think the next step would be to try to come up with a high-level > > > > user interface design that has a chance to get merged, rather than > > > > addressing the review comments for the current implementation. > > Hello, > > Sorry for late response but while thinking new approaches, I come up with > some different idea and want to hear your opinions. How about offload > all control to user space while the driver just offers read/write access > to the needed registers? Let me explain in detail. > > Although I searched similar functions in other products, I could not find > it. Also, this hardware barrier performs intra-numa synchronization and > it is hard to be used for general inter-process barrier. So I think > generalizing this feature in kernel does not go well. Ok, thank you for taking a look. > As I said this is mainly for HPC application. In the usual situations, the > user has full control of the PC nodes when running HPC application and > thus the user has full responsibility of running processes on the machine. > Offloading all controls to these registers to the user is acceptable in that > case (i.e. the driver just offers access to the registers and does not control it). > This is the safe for the kernel operation as manipulating barrier related > registers just affects user application. > > In this approach we could remove ioctls or control logic in the driver but > we need some way to access the needed registers. I firstly think if I can > use x86's MSR driver like approach but I know the idea is rejected > recently for security concerns: > https://lore.kernel.org/linux-arm-kernel/20201130174833.41315-1-rongwei.wang@linux.alibaba.com/ > > Based on these observations, I have two ideas currently: > 1) make the driver to only expose sysfs interface for reading/writing > A64FX's barrier registers > or > 2) generalizing (1) in some way; To make some mechanism to expose > CPU defined registers which can be safely accessed from user space > > Are these idea acceptable ways to explore to get merged in upstream? > I'd appreciate any criticism/comments. I'm also running out of ideas here. I don't think a sysfs interface would be any different to your earlier ioctl interface or the the /dev/msr approach, they all share the same problem that they expose low-level access to platform specific registers in a way that is neither portable nor safe to use for general-purpose applications outside the very narrow scope of running highly optimized HPC applications. You can of course continue using the module you have as an external module that gets built outside of the kernel itself, and shipped along with the application or library using it, rather than with the kernel. Obviously this is not something I would generally recommend either, but this may be the last resort to fall back to when everything else fails. Arnd
On Thu, Feb 18, 2021 at 10:49 AM misono.tomohiro@fujitsu.com > <misono.tomohiro@fujitsu.com> wrote: > > > > > > > > Also, It is common usage that each running thread is bound to one PE in > > > > > > multi-threaded HPC applications. > > > > > > > > > > I think the expectation that all threads are bound to a physical CPU > > > > > makes sense for using this feature, but I think it would be necessary > > > > > to enforce that, e.g. by allowing only threads to enable it after they > > > > > are isolated to a non-shared CPU, and automatically disabling it > > > > > if the CPU isolation is changed. > > > > > > > > > > For the user space interface, something based on process IDs > > > > > seems to make more sense to me than something based on CPU > > > > > numbers. All of the above does require some level of integration > > > > > with the core kernel of course. > > > > > > > > > > I think the next step would be to try to come up with a high-level > > > > > user interface design that has a chance to get merged, rather than > > > > > addressing the review comments for the current implementation. > > > > Hello, > > > > Sorry for late response but while thinking new approaches, I come up with > > some different idea and want to hear your opinions. How about offload > > all control to user space while the driver just offers read/write access > > to the needed registers? Let me explain in detail. > > > > Although I searched similar functions in other products, I could not find > > it. Also, this hardware barrier performs intra-numa synchronization and > > it is hard to be used for general inter-process barrier. So I think > > generalizing this feature in kernel does not go well. > > Ok, thank you for taking a look. > > > As I said this is mainly for HPC application. In the usual situations, the > > user has full control of the PC nodes when running HPC application and > > thus the user has full responsibility of running processes on the machine. > > Offloading all controls to these registers to the user is acceptable in that > > case (i.e. the driver just offers access to the registers and does not control it). > > This is the safe for the kernel operation as manipulating barrier related > > registers just affects user application. > > > > In this approach we could remove ioctls or control logic in the driver but > > we need some way to access the needed registers. I firstly think if I can > > use x86's MSR driver like approach but I know the idea is rejected > > recently for security concerns: > > https://lore.kernel.org/linux-arm-kernel/20201130174833.41315-1-rongwei.wang@linux.alibaba.com/ > > > > Based on these observations, I have two ideas currently: > > 1) make the driver to only expose sysfs interface for reading/writing > > A64FX's barrier registers > > or > > 2) generalizing (1) in some way; To make some mechanism to expose > > CPU defined registers which can be safely accessed from user space > > > > Are these idea acceptable ways to explore to get merged in upstream? > > I'd appreciate any criticism/comments. > > I'm also running out of ideas here. I don't think a sysfs interface would > be any different to your earlier ioctl interface or the the /dev/msr approach, > they all share the same problem that they expose low-level access to > platform specific registers in a way that is neither portable nor safe to > use for general-purpose applications outside the very narrow scope > of running highly optimized HPC applications. Ok, but ARM architecture permits implementation defined registers at the first place. So can we provide some method/interface to access them as CPU feature if these registers do not at least affect kernel operations (like this barrier) and only root can access them? Library could offer portable way for user applications (under root permission) to access them. > You can of course continue using the module you have as an external > module that gets built outside of the kernel itself, and shipped along > with the application or library using it, rather than with the kernel. > Obviously this is not something I would generally recommend either, > but this may be the last resort to fall back to when everything else > fails. Understood. This is the last resort. Thanks very much for you help. Regards, Tomohiro
On Wed, Mar 3, 2021 at 12:20 PM misono.tomohiro@fujitsu.com <misono.tomohiro@fujitsu.com> wrote: > On Thu, Feb 18, 2021 at 10:49 AM misono.tomohiro@fujitsu.com <misono.tomohiro@fujitsu.com> wrote: > > > I'm also running out of ideas here. I don't think a sysfs interface would > > be any different to your earlier ioctl interface or the the /dev/msr approach, > > they all share the same problem that they expose low-level access to > > platform specific registers in a way that is neither portable nor safe to > > use for general-purpose applications outside the very narrow scope > > of running highly optimized HPC applications. > > Ok, but ARM architecture permits implementation defined registers at the > first place. So can we provide some method/interface to access them as > CPU feature if these registers do not at least affect kernel operations (like > this barrier) and only root can access them? Library could offer portable way > for user applications (under root permission) to access them. The kernel is meant to provide an abstraction for any differences between the CPUs, including implementation defined registers. While any such abstraction will be leaky, just passing through the raw registers is generally not a helpful abstraction at all, as seen from the x86 MSR discussion you pointed to. One problem with having a root-only register level interface is that this can break the boundary between kernel mode and root user space, and this is something that a lot of people would like to strengthen for security reasons (e.g. a root user should not be able to break secure boot). Another problem is that exposing the raw registers from kernel space creates an ABI, and if it turns out to be a bad idea later on, this is hard to take back without breaking existing applications. Not breaking things that used to work is the primary rule for the Linux kernel. In order to merge anything into the mainline kernel, I think the requirement would be that it does provide a sensible abstraction inside of the kernel that can directly be used from applications without having to go through another library that abstracts it, and that has a good chance of being supportable forever. Arnd
> > > I'm also running out of ideas here. I don't think a sysfs interface would > > > be any different to your earlier ioctl interface or the the /dev/msr approach, > > > they all share the same problem that they expose low-level access to > > > platform specific registers in a way that is neither portable nor safe to > > > use for general-purpose applications outside the very narrow scope > > > of running highly optimized HPC applications. > > > > Ok, but ARM architecture permits implementation defined registers at the > > first place. So can we provide some method/interface to access them as > > CPU feature if these registers do not at least affect kernel operations (like > > this barrier) and only root can access them? Library could offer portable way > > for user applications (under root permission) to access them. > > The kernel is meant to provide an abstraction for any differences between the > CPUs, including implementation defined registers. While any such abstraction > will be leaky, just passing through the raw registers is generally not a helpful > abstraction at all, as seen from the x86 MSR discussion you pointed to. > > One problem with having a root-only register level interface is that this > can break the boundary between kernel mode and root user space, and > this is something that a lot of people would like to strengthen for security > reasons (e.g. a root user should not be able to break secure boot). > > Another problem is that exposing the raw registers from kernel space > creates an ABI, and if it turns out to be a bad idea later on, this is hard to > take back without breaking existing applications. Not breaking things that > used to work is the primary rule for the Linux kernel. Ok, thanks for the thorough explanations. It helps my understandings. > In order to merge anything into the mainline kernel, I think the requirement > would be that it does provide a sensible abstraction inside of the kernel > that can directly be used from applications without having to go through > another library that abstracts it, and that has a good chance of being > supportable forever. As you mentioned an idea of process-based approach earlier, I will reconsider the possibility of general abstraction interface in that way. Regards, Tomohiro