[RFC] tentative prctl task isolation interface

Message ID	20210113121544.GA16380@fuller.cnet (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=ytkm=GQ=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1A75B233CE Date: Wed, 13 Jan 2021 09:15:44 -0300 From: Marcelo Tosatti <mtosatti@redhat.com> To: Alex Belits <abelits@marvell.com> Cc: "tglx@linutronix.de" <tglx@linutronix.de>, "cl@linux.com" <cl@linux.com>, "pauld@redhat.com" <pauld@redhat.com>, "linux-mm@kvack.org" <linux-mm@kvack.org>, "frederic@kernel.org" <frederic@kernel.org>, "willy@infradead.org" <willy@infradead.org>, "peterz@infradead.org" <peterz@infradead.org>, "akpm@linux-foundation.org" <akpm@linux-foundation.org>, Juri Lelli <juri.lelli@redhat.com>, Daniel Bristot de Oliveira <bristot@redhat.com> Subject: [RFC] tentative prctl task isolation interface Message-ID: <20210113121544.GA16380@fuller.cnet> References: <20201117162805.GA274911@fuller.cnet> <20201117180356.GT29991@casper.infradead.org> <alpine.DEB.2.22.394.2011171855500.215602@www.lameter.com> <20201117202317.GA282679@fuller.cnet> <alpine.DEB.2.22.394.2011201817320.248402@www.lameter.com> <20201127154845.GA9100@fuller.cnet> <alpine.DEB.2.22.394.2011300927120.337729@www.lameter.com> <87h7p4dwus.fsf@nanos.tec.linutronix.de> <12ddb629555590cfd41db5b10854d95c1f154e24.camel@marvell.com> MIME-Version: 1.0 In-Reply-To: <12ddb629555590cfd41db5b10854d95c1f154e24.camel@marvell.com> User-Agent: Mutt/1.10.1 (2018-07-13) Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[RFC] tentative prctl task isolation interface \| expand [RFC] tentative prctl task isolation interface

Marcelo Tosatti Jan. 13, 2021, 12:15 p.m. UTC

Hi,

So as discussed, this is one possible prctl interface for
task isolation.

Is this something that is desired? If not, what is the
proper way for the interface to be?

(addition of a new capability CAP_TASK_ISOLATION, 
for permissions is still missing, should be done 
in the next versions).

Thanks.

add prctl interface for task isolation

Add a new extensible interface for task isolation,
and allow userspace to quiesce the CPU.

This means putting the system into a quiet state by 
completing all workqueue items, idle all subsystems
that need it and put the cpu into NOHZ mode.


Suggested-by: Christopher Lameter <cl@linux.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Christoph Lameter (Ampere) Jan. 14, 2021, 9:22 a.m. UTC | #1

On Wed, 13 Jan 2021, Marcelo Tosatti wrote:

> So as discussed, this is one possible prctl interface for
> task isolation.
>
> Is this something that is desired? If not, what is the
> proper way for the interface to be?

Sure that sounds liek a good beginning but I guess we need some
specificity on the features

> +Task isolation CPU interface
> +============================

How does one do a oneshot flush of OS activities?

I.e. I have a polling loop over numerous shared and I/o devices in user
space and I want to make sure that the system is quite before I enter the
loop. In the loop itself some activities may require syscalls so they will
potentialy cause the OS services such as timers to start again. When such
an activities is complete another quiet down call can be issued.

Could be implemented by setting a flag that does an action and then resets
itself?  Or the flag could be reset if a syscall that requires timers etc
is used?

Features that I think may be needed:

F_ISOL_QUIESCE		-> quiet down now but allow all OS activities. OS
			activites reset flag

F_ISOL_BAREMETAL_HARD	-> No OS interruptions. Fault on syscalls that
			require such actions in the future.

F_ISOL_BAREMETAL_WARN	-> Similar. Create a warning in the syslog when OS
				services require delayed processing etc
				but continue while resetting the flag.

Marcelo Tosatti Jan. 14, 2021, 7:34 p.m. UTC | #2

On Thu, Jan 14, 2021 at 09:22:54AM +0000, Christoph Lameter wrote:
> On Wed, 13 Jan 2021, Marcelo Tosatti wrote:
> 
> > So as discussed, this is one possible prctl interface for
> > task isolation.
> >
> > Is this something that is desired? If not, what is the
> > proper way for the interface to be?
> 
> Sure that sounds liek a good beginning but I guess we need some
> specificity on the features
> 
> > +Task isolation CPU interface
> > +============================
> 
> How does one do a oneshot flush of OS activities?

        ret = prctl(PR_TASK_ISOLATION_REQUEST, ISOL_F_QUIESCE, 0, 0, 0);
        if (ret == -1) {
                perror("prctl PR_TASK_ISOLATION_REQUEST");
                exit(0);
        }

> 
> I.e. I have a polling loop over numerous shared and I/o devices in user
> space and I want to make sure that the system is quite before I enter the
> loop. 

You could configure things in two ways: with syscalls allowed or not. 

Syscalls disallowed:
===================

1) Add a new isolation feature ISOL_F_BLOCK_SYSCALLS (to block certain
syscalls) along with ISOL_F_SETUP_NOTIF (to notify upon isolation
breaking):

        if ((ifeat & ISOL_F_BLOCK_SYSCALLS) == ISOL_F_BLOCK_SYSCALLS) {
		struct task_isolation_block_syscalls tibs = { list of
							 syscalls to block,
							 additional
							 parameters }

		struct task_isolation_notif tis = { parameters to control
						signal handling upon
						isolation breaking event }

                ret = prctl(PR_TASK_ISOLATION_SET, ISOL_F_SETUP_NOTIF, &tis);
		if (ret != 0) { ... }
		featuremask |= ISOL_F_SETUP_NOTIF;

                ret = prctl(PR_TASK_ISOLATION_SET, ISOL_F_BLOCK_SYSCALLS, &tibs);
		if (ret != 0) { ... }
		featuremask |= ISOL_F_BLOCK_SIGNALS;

                featuremask |= ISOL_F_QUIESCE;
        }

This would require knowledge of the behaviour of individual system
calls, that is whether or not these syscalls cause the CPU to be a target
of interruptions (1) (while the QUIESCE / HARD / WARN division you propose 
allows for coarse-grained control).

Perhaps coarse control while also allowing finer grained control 
(if desired) is a useful choice?

1: for example adding free pages to per-cpu free lists.

Syscalls allowed:
=================

> In the loop itself some activities may require syscalls so they will
> potentialy cause the OS services such as timers to start again.

Or a different mode where the syscall return itself can finish
any pending activities.

> When such
> an activities is complete another quiet down call can be issued.

Although this seems more efficient (if multiple syscalls are to be
used).

> Could be implemented by setting a flag that does an action and then resets
> itself?  Or the flag could be reset if a syscall that requires timers etc
> is used?

You mean to let userspace know if a certain syscall triggered a pending
action which must be finished (before "quiet mode" is entered again) ?
Sounds like a good idea.

> Features that I think may be needed:
> 
> F_ISOL_QUIESCE		-> quiet down now but allow all OS activities. OS
> 			activites reset flag
> 
> F_ISOL_BAREMETAL_HARD	-> No OS interruptions. Fault on syscalls that
> 			require such actions in the future.

Question: why BAREMETAL ?

Two comments:

1) HARD mode could also block activities from different CPUs that can 
interrupt this isolated CPU (for example CPU hotplug, or increasing 
per-CPU trace buffer size).

Unclear whether such blockage should be performed on:

-> Individual action basis (eg: BLOCK_CPU_HOTPLUG,
BLOCK_PERCPU_TRACEBUFFER_SIZE, ...) (which could allow
individual unblocking through a sysfs interface, for example).

Or

-> Be tied to a flag with a less implementation specific meaning such as
F_ISOL_BAREMETAL_HARD.

2) For a type of application it is the case that certain interruptions
can be tolerated, as long as they do not cross certain thresholds.
For example, one loses the flexibility to read/write MSRs 
on the isolated CPUs (including performance counters,
RDT/MBM type MSRs, frequency/power statistics) by 
forcing a "no interruptions" mode.

That flexibility seems to be useful (so perhaps 
F_ISOL_BAREMETAL_HARD but optionally permitting 
certain interruptions).

> F_ISOL_BAREMETAL_WARN	-> Similar. Create a warning in the syslog when OS
> 				services require delayed processing etc
> 				but continue while resetting the flag.

Alex seems to be interested in different notification methods as well.

Thanks for the input.

Christoph Lameter (Ampere) Jan. 15, 2021, 1:24 p.m. UTC | #3

On Thu, 14 Jan 2021, Marcelo Tosatti wrote:

> > How does one do a oneshot flush of OS activities?
>
>         ret = prctl(PR_TASK_ISOLATION_REQUEST, ISOL_F_QUIESCE, 0, 0, 0);
>         if (ret == -1) {
>                 perror("prctl PR_TASK_ISOLATION_REQUEST");
>                 exit(0);
>         }
>
> >
> > I.e. I have a polling loop over numerous shared and I/o devices in user
> > space and I want to make sure that the system is quite before I enter the
> > loop.
>
> You could configure things in two ways: with syscalls allowed or not.

Well syscalls that do not cause deferred processing like getting the time
or determining the current cpu should be ok to use.

And I already said that I want the system to quiet down and allow system
calls. Some indication that deferred actions have occurred may be useful
by f.e. resetting the flag.

> 1) Add a new isolation feature ISOL_F_BLOCK_SYSCALLS (to block certain
> syscalls) along with ISOL_F_SETUP_NOTIF (to notify upon isolation
> breaking):

Well come up with a use case for that .... I know mine. What you propose
could be  useful for debugging for me but I would prefer the quiet down
approach where I determine when I use some syscalls or not and will deal
with the consequences.

>
> > Features that I think may be needed:
> >
> > F_ISOL_QUIESCE		-> quiet down now but allow all OS activities. OS
> > 			activites reset flag
> >
> > F_ISOL_BAREMETAL_HARD	-> No OS interruptions. Fault on syscalls that
> > 			require such actions in the future.
>
> Question: why BAREMETAL ?

To disinguish it from "Realtime". We want the processor for ourselves
without anything else running on it.

> Two comments:
>
> 1) HARD mode could also block activities from different CPUs that can
> interrupt this isolated CPU (for example CPU hotplug, or increasing
> per-CPU trace buffer size).

Blocking? The app should fail if any deferred actions are triggered as a
result of syscalls. It would give a warning with _WARN

> 2) For a type of application it is the case that certain interruptions
> can be tolerated, as long as they do not cross certain thresholds.
> For example, one loses the flexibility to read/write MSRs
> on the isolated CPUs (including performance counters,
> RDT/MBM type MSRs, frequency/power statistics) by
> forcing a "no interruptions" mode.

Does reading these really cause deferred actions by the OS? AFAICT you
could map these into memory as well as read them without OS activities.

"Interruptions that can be tolerated".... Well that is the wild west of
"realtime" where you can define how much of a time slice is "real" and how
much can be use by other processes. I do not think that any of that should
come into this API.

Alex Belits Jan. 15, 2021, 6:35 p.m. UTC | #4

On 1/15/21 05:24, Christoph Lameter wrote:

> ----------------------------------------------------------------------
> On Thu, 14 Jan 2021, Marcelo Tosatti wrote:
> 
>>> How does one do a oneshot flush of OS activities?
>>
>>          ret = prctl(PR_TASK_ISOLATION_REQUEST, ISOL_F_QUIESCE, 0, 0, 0);
>>          if (ret == -1) {
>>                  perror("prctl PR_TASK_ISOLATION_REQUEST");
>>                  exit(0);
>>          }
>>
>>>
>>> I.e. I have a polling loop over numerous shared and I/o devices in user
>>> space and I want to make sure that the system is quite before I enter the
>>> loop.
>>
>> You could configure things in two ways: with syscalls allowed or not.
> 
> Well syscalls that do not cause deferred processing like getting the time
> or determining the current cpu should be ok to use.

Some of those syscalls go through vdso, and don't enter the kernel -- 
nothing specific is necessary to allow them, and it would be pointless 
and difficult to prevent them.

For syscalls that enter the kernel, it's often difficult to predict, if 
they will or won't cause deferred processing, so I am afraid, it won't 
be possible to provide a "safe" class of syscalls for this purpose and 
not end up with something minimal like reading /sys and /proc. Right now 
isolation only "allows" syscalls that exit isolation.

It may be possible to set up a filter by the system (allowing few safe 
things like reading /proc) and let the user expand it by adding 
combinations of syscall / file descriptor. If some device is known to 
process operations safely, user can open it and mark file descriptor as 
allowed, say, for reading.

> And I already said that I want the system to quiet down and allow system
> calls. Some indication that deferred actions have occurred may be useful
> by f.e. resetting the flag.

I think, it should be possible to process a syscall, and if any deferred 
action occurred, exit isolation on return to userspace. Then there is a 
question, how userspace should be notified about isolation being lost. 
Normally this happens with a signal, but that is useful if we want 
syscall to fail with EINTR, not to succeed. Make sure that signal 
arrives after successful syscall return but before deferred action to 
happen? Sounds convoluted. Maybe reflecting isolation status in vdso and 
having the user check it there will be a good solution.

When I worked on my implementation I have encountered both a problem of 
interaction with the rest of system from isolated task (at least simple 
things as logging) and a problem of handling enter/exit from isolation 
on a system when it's possible for a task to be interrupted early after 
entering isolation due to various events that were still in progress on 
other CPUs.

I ended up implementing a manager/helper task that talks to tasks over a 
socket (when they are not isolated) and over ring buffers in shared 
memory (when they are isolated). While the current implementation is 
rather limited, the intention is to delegate to it everything that 
isolated task either can't do at all (like, writing logs) or that it 
would be cumbersome to implement (like monitoring the state of task, 
determining presence of deferred work after the task returned to 
userspace), etc.

It would be great if the complexity and amount of functionality of that 
manager/helper task can be reduced, however I believe that having such a 
task is a legitimate way of implementing things that otherwise would 
require additional functionality in kernel.

> 
>> 1) Add a new isolation feature ISOL_F_BLOCK_SYSCALLS (to block certain
>> syscalls) along with ISOL_F_SETUP_NOTIF (to notify upon isolation
>> breaking):
> 
> Well come up with a use case for that .... I know mine. What you propose
> could be  useful for debugging for me but I would prefer the quiet down
> approach where I determine when I use some syscalls or not and will deal
> with the consequences.

For my purposes breaking isolation on syscalls and notifications about 
isolation breaking is a very useful approach -- this is why I kept it 
exactly as it was in the original implementation by Chris Metcalf.

In applications that I intend to use isolation for, the primary concern 
is consistent time for running code in userspace, so syscalls should be 
only issued when the task is specifically not in isolated mode. If the 
program issues a syscall by mistake (and that may happen when some 
libraries are used, or thread synchronization primitives are kept from 
non-isolated version of the program, even though isolated tasks are not 
supposed to use those), it means not only that deferred work may cause 
delay in the future, but also that there is an additional time to be 
spent in kernel. This should be immediately visible to the developer, 
and the best way to do it is by breaking isolation on syscall immediately.

> 
>>
>>> Features that I think may be needed:
>>>
>>> F_ISOL_QUIESCE		-> quiet down now but allow all OS activities. OS
>>> 			activites reset flag
>>>
>>> F_ISOL_BAREMETAL_HARD	-> No OS interruptions. Fault on syscalls that
>>> 			require such actions in the future.
>>
>> Question: why BAREMETAL ?
> 
> To disinguish it from "Realtime". We want the processor for ourselves
> without anything else running on it.
> 
>> Two comments:
>>
>> 1) HARD mode could also block activities from different CPUs that can
>> interrupt this isolated CPU (for example CPU hotplug, or increasing
>> per-CPU trace buffer size).
> 
> Blocking? The app should fail if any deferred actions are triggered as a
> result of syscalls. It would give a warning with _WARN

There are many supposedly innocent things, nowhere at the scale of CPU 
hotplug, that happen in a system and result in synchronization 
implemented as an IPI to every online CPU. We should consider them to be 
an ordinary occurrence, so there is a choice:

1. Ignore them completely and allow them in isolated mode. This will 
delay userspace with no indication and no isolation breaking.

2. Allow them, and notify userspace afterwards (through vdso or through 
userspace helper/manager over shared memory). This may be useful in 
those rare situations when the consequences of delay can be mitigated 
afterwards.

3. Make them break isolation, with userspace being notified normally 
(ex: with a signal in the current implementation). I guess, can be used 
if somehow most of the causes will be eliminated.

4. Prevent them from reaching the target CPU and make sure that whatever 
synchronization they are intended to cause, will happen when intended 
target CPU will enter to kernel later. Since we may have to synchronize 
things like code modification, some of this synchronization has to 
happen very early on kernel entry.

I am most interested in (4), so this is what was implemented in my 
version of the patch (and currently I am trying to achieve completeness 
and, if possible, elegance of the implementation).

I guess, if we want to add more controls, we can allow the user to 
choose either of those four options, or of a subset of them. In my 
opinion, if (4) will be available, and the only additional cost will be 
time for synchronization spent in breaking isolation procedure, there is 
not much need in the other three. Without (4) I don't think, the goal of 
providing consistent, interruption-free environment is achieved at all, 
so not implementing it would be very bad.

>> 2) For a type of application it is the case that certain interruptions
>> can be tolerated, as long as they do not cross certain thresholds.
>> For example, one loses the flexibility to read/write MSRs
>> on the isolated CPUs (including performance counters,
>> RDT/MBM type MSRs, frequency/power statistics) by
>> forcing a "no interruptions" mode.
> 
> Does reading these really cause deferred actions by the OS? AFAICT you
> could map these into memory as well as read them without OS activities.

Access to those is hardware/architecture-specific, and in many cases, 
indeed, there is no need to issue a syscall at all.

However for many applications the model with a helper task performing 
interactions with OS on a different core and exchanging data over shared 
memory may be sufficient, and it will also provide clear separation 
between operations that do require consistent timing and those that don't.

> 
> "Interruptions that can be tolerated".... Well that is the wild west of
> "realtime" where you can define how much of a time slice is "real" and how
> much can be use by other processes. I do not think that any of that should
> come into this API.
> 

To be honest, I have no idea, what can and can not be tolerated by 
applications other than what I am familiar with. Applications that I 
know, require no interruptions at all, so I want to implement that. I 
assume, someone already uses existing CPU isolation for the purpose of 
providing "nearly interrupt-less" environment.

I can imaging something like a task of controlling a large slow-updating 
LED display by bit-banging a strictly timed long serial message 
representing a frame or frame update. If interrupted, it may, depending 
on the protocol, corrupt the state of a single LED or fail to update 
until the end of the screen, but the next start of message will reset 
the state, and everything will work until the next interrupt. Maybe 
there are more realistic or useful examples.

Marcelo Tosatti Jan. 18, 2021, 3:18 p.m. UTC | #5

On Fri, Jan 15, 2021 at 01:24:10PM +0000, Christoph Lameter wrote:
> On Thu, 14 Jan 2021, Marcelo Tosatti wrote:
> 
> > > How does one do a oneshot flush of OS activities?
> >
> >         ret = prctl(PR_TASK_ISOLATION_REQUEST, ISOL_F_QUIESCE, 0, 0, 0);
> >         if (ret == -1) {
> >                 perror("prctl PR_TASK_ISOLATION_REQUEST");
> >                 exit(0);
> >         }
> >
> > >
> > > I.e. I have a polling loop over numerous shared and I/o devices in user
> > > space and I want to make sure that the system is quite before I enter the
> > > loop.
> >
> > You could configure things in two ways: with syscalls allowed or not.
> 
> Well syscalls that do not cause deferred processing like getting the time
> or determining the current cpu should be ok to use.

Yes.

> And I already said that I want the system to quiet down and allow system
> calls. 

Also see that as being useful.

> Some indication that deferred actions have occurred may be useful
> by f.e. resetting the flag.

OK: will implement on next patchset.

> > 1) Add a new isolation feature ISOL_F_BLOCK_SYSCALLS (to block certain
> > syscalls) along with ISOL_F_SETUP_NOTIF (to notify upon isolation
> > breaking):
> 
> Well come up with a use case for that .... I know mine. 

Trying to come up with an interface that accomodates all known use
cases. Maybe passing the allowed list of syscalls is overkill,
but Alex seems interested in the notification to break isolation.

> What you propose
> could be  useful for debugging for me but I would prefer the quiet down
> approach where I determine when I use some syscalls or not and will deal
> with the consequences.

Trying to cover the usecases that Alex mentioned on this thread...

> > > Features that I think may be needed:
> > >
> > > F_ISOL_QUIESCE		-> quiet down now but allow all OS activities. OS
> > > 			activites reset flag
> > >
> > > F_ISOL_BAREMETAL_HARD	-> No OS interruptions. Fault on syscalls that
> > > 			require such actions in the future.
> >
> > Question: why BAREMETAL ?
> 
> To disinguish it from "Realtime". We want the processor for ourselves
> without anything else running on it.

OK.

> > Two comments:
> >
> > 1) HARD mode could also block activities from different CPUs that can
> > interrupt this isolated CPU (for example CPU hotplug, or increasing
> > per-CPU trace buffer size).
> 
> Blocking? 

Block CPU hotplug for example: 

# echo 0 > /sys/devices/system/cpu/cpu3/online
returns -EBUSY with message saying:

"Can't offline cpu3: reason: cpu9 is isolated by application APP".

> The app should fail if any deferred actions are triggered as a
> result of syscalls. It would give a warning with _WARN
> 
> > 2) For a type of application it is the case that certain interruptions
> > can be tolerated, as long as they do not cross certain thresholds.
> > For example, one loses the flexibility to read/write MSRs
> > on the isolated CPUs (including performance counters,
> > RDT/MBM type MSRs, frequency/power statistics) by
> > forcing a "no interruptions" mode.
> 
> Does reading these really cause deferred actions by the OS? AFAICT you
> could map these into memory as well as read them without OS activities.

AFAIK you can't for MSRs.

> "Interruptions that can be tolerated".... Well that is the wild west of
> "realtime" where you can define how much of a time slice is "real" and how
> much can be use by other processes. I do not think that any of that should
> come into this API.

Understood.

Marcelo Tosatti Jan. 21, 2021, 3:51 p.m. UTC | #6

Hi Alex,

On Fri, Jan 15, 2021 at 10:35:14AM -0800, Alex Belits wrote:
> On 1/15/21 05:24, Christoph Lameter wrote:
> 
> > ----------------------------------------------------------------------
> > On Thu, 14 Jan 2021, Marcelo Tosatti wrote:
> > 
> > > > How does one do a oneshot flush of OS activities?
> > > 
> > >          ret = prctl(PR_TASK_ISOLATION_REQUEST, ISOL_F_QUIESCE, 0, 0, 0);
> > >          if (ret == -1) {
> > >                  perror("prctl PR_TASK_ISOLATION_REQUEST");
> > >                  exit(0);
> > >          }
> > > 
> > > > 
> > > > I.e. I have a polling loop over numerous shared and I/o devices in user
> > > > space and I want to make sure that the system is quite before I enter the
> > > > loop.
> > > 
> > > You could configure things in two ways: with syscalls allowed or not.
> > 
> > Well syscalls that do not cause deferred processing like getting the time
> > or determining the current cpu should be ok to use.
> 
> Some of those syscalls go through vdso, and don't enter the kernel --
> nothing specific is necessary to allow them, and it would be pointless and
> difficult to prevent them.
> 
> For syscalls that enter the kernel, it's often difficult to predict, if they
> will or won't cause deferred processing, so I am afraid, it won't be
> possible to provide a "safe" class of syscalls for this purpose and not end
> up with something minimal like reading /sys and /proc. Right now isolation
> only "allows" syscalls that exit isolation.

Christoph wrote:

"> Features that I think may be needed:
> 
> F_ISOL_QUIESCE                -> quiet down now but allow all OS activities. OS
>                       activites reset flag
> 
> F_ISOL_BAREMETAL_HARD -> No OS interruptions. Fault on syscalls that
>                       require such actions in the future.
> 
> F_ISOL_BAREMETAL_WARN -> Similar. Create a warning in the syslog when OS
>                               services require delayed processing etc
>                               but continue while resetting the flag.
"

It seems the only difference between HARD and WARN (lets call it SOFT) 
would be whether a notification is sent to userspace.

The definition 

"F_ISOL_BAREMETAL_HARD -> No OS interruptions. Fault on syscalls that
                       require such actions in the future."

fails in the static_key_enable case: Alex's idea is to queue the i-cache
flush if the remote task/cpu is in isolated mode (and perform the flush 
when entering the kernel).

So even if userspace uses syscalls that do not require delayed
processing, there are events which are out of control of the
application and might require it.

So lets assume the application performs a number of syscalls on a
given time critical codepath. 

Either the system is configured so that 
the number/frequency of static_key_enable's is limited, or the cost of
i-cache flushes must be accounted on that critical codepath.

Anyway, trying to improve Christoph's definition:

F_ISOL_QUIESCE                -> flush any pending operations that might cause
				 the CPU to be interrupted (ex: free's
				 per-CPU queues, sync MM statistics
				 counters, etc).

F_ISOL_ISOLATE		      -> inform the kernel that userspace is
				 entering isolated mode (see description
				 below on "ISOLATION MODES").

F_ISOL_UNISOLATE              -> inform the kernel that userspace is
				 leaving isolated mode.

F_ISOL_NOTIFY		      -> notification mode of isolation breakage
				 modes.


Isolation modes:
---------------

There are two main types of isolation modes: 

- SOFT mode: does not prevent activities which might generate interruptions
(such as CPU hotplug).

- HARD mode: prevents all blockable activities that might generate interruptions.
Administrators can override this via /sys.

Notifications:
-------------

Notification mode of isolation breakage can be configured as follows:

- None (default): No notification is performed by the kernel on isolation
  breakage.

- Syslog: Isolation breakage is reported to syslog. 

(new modes can be added, for example signals).

A new feature can be added to disallow syscalls (by default syscalls
are enabled, with reporting of pending activities that might cause
an interruption in a VDSO).

How about that?

> F_ISOL_BAREMETAL_HARD -> No OS interruptions. Fault on syscalls that
>                       require such actions in the future.
> 
> F_ISOL_BAREMETAL_WARN -> Similar. Create a warning in the syslog when OS
>                               services require delayed processing etc
>                               but continue while resetting the flag.



> It may be possible to set up a filter by the system (allowing few safe
> things like reading /proc) and let the user expand it by adding combinations
> of syscall / file descriptor. If some device is known to process operations
> safely, user can open it and mark file descriptor as allowed, say, for
> reading.

Makes sense.

> > And I already said that I want the system to quiet down and allow system
> > calls. Some indication that deferred actions have occurred may be useful
> > by f.e. resetting the flag.

Do you think reporting activities that add overhead (the i-cache flush
in mind) to syscalls separately in the VDSO?

> I think, it should be possible to process a syscall, and if any deferred
> action occurred, exit isolation on return to userspace. 

On the interface we are creating:

	ret = syscall()...
	if (vdso.pending_activity) {
		prctl(PR_TASK_ISOLATION_REQUEST, F_ISOL_UNISOLATE, 0, 0);
		...
	}

Why would it be necessary to exit isolation on return to userspace
again?

> Then there is a
> question, how userspace should be notified about isolation being lost.
> Normally this happens with a signal, but that is useful if we want syscall
> to fail with EINTR, not to succeed. Make sure that signal arrives after
> successful syscall return but before deferred action to happen? Sounds
> convoluted. Maybe reflecting isolation status in vdso and having the user
> check it there will be a good solution.

Why can't userspace enable/disable isolation mode (and the kernel only
reports it) ?

I fail to see why the order of the events "isolated mode disablement"
and "return to userspace" is critical.

> When I worked on my implementation I have encountered both a problem of
> interaction with the rest of system from isolated task (at least simple
> things as logging) and a problem of handling enter/exit from isolation on a
> system when it's possible for a task to be interrupted early after entering
> isolation due to various events that were still in progress on other CPUs.
> 
> I ended up implementing a manager/helper task that talks to tasks over a
> socket (when they are not isolated) and over ring buffers in shared memory
> (when they are isolated). While the current implementation is rather
> limited, the intention is to delegate to it everything that isolated task
> either can't do at all (like, writing logs) or that it would be cumbersome
> to implement (like monitoring the state of task, determining presence of
> deferred work after the task returned to userspace), etc.

Interesting. Are you considering opensourcing such library? Seems like a
generic problem.

> It would be great if the complexity and amount of functionality of that
> manager/helper task can be reduced, however I believe that having such a
> task is a legitimate way of implementing things that otherwise would require
> additional functionality in kernel.
> 
> > 
> > > 1) Add a new isolation feature ISOL_F_BLOCK_SYSCALLS (to block certain
> > > syscalls) along with ISOL_F_SETUP_NOTIF (to notify upon isolation
> > > breaking):
> > 
> > Well come up with a use case for that .... I know mine. What you propose
> > could be  useful for debugging for me but I would prefer the quiet down
> > approach where I determine when I use some syscalls or not and will deal
> > with the consequences.
> 
> For my purposes breaking isolation on syscalls and notifications about
> isolation breaking is a very useful approach -- this is why I kept it
> exactly as it was in the original implementation by Chris Metcalf.
> 
> In applications that I intend to use isolation for, the primary concern is
> consistent time for running code in userspace, so syscalls should be only
> issued when the task is specifically not in isolated mode. If the program
> issues a syscall by mistake (and that may happen when some libraries are
> used, or thread synchronization primitives are kept from non-isolated
> version of the program, even though isolated tasks are not supposed to use
> those), it means not only that deferred work may cause delay in the future,
> but also that there is an additional time to be spent in kernel. This should
> be immediately visible to the developer, and the best way to do it is by
> breaking isolation on syscall immediately.

I guess you can do that by hooking a BPF program to cpu->is_isolated ==
true (for development) and syscall entry.

> > > 
> > > > Features that I think may be needed:
> > > > 
> > > > F_ISOL_QUIESCE		-> quiet down now but allow all OS activities. OS
> > > > 			activites reset flag
> > > > 
> > > > F_ISOL_BAREMETAL_HARD	-> No OS interruptions. Fault on syscalls that
> > > > 			require such actions in the future.
> > > 
> > > Question: why BAREMETAL ?
> > 
> > To disinguish it from "Realtime". We want the processor for ourselves
> > without anything else running on it.
> > 
> > > Two comments:
> > > 
> > > 1) HARD mode could also block activities from different CPUs that can
> > > interrupt this isolated CPU (for example CPU hotplug, or increasing
> > > per-CPU trace buffer size).
> > 
> > Blocking? The app should fail if any deferred actions are triggered as a
> > result of syscalls. It would give a warning with _WARN
> 
> There are many supposedly innocent things, nowhere at the scale of CPU
> hotplug, that happen in a system and result in synchronization implemented
> as an IPI to every online CPU. We should consider them to be an ordinary
> occurrence, so there is a choice:
> 
> 1. Ignore them completely and allow them in isolated mode. This will delay
> userspace with no indication and no isolation breaking.
> 
> 2. Allow them, and notify userspace afterwards (through vdso or through
> userspace helper/manager over shared memory). This may be useful in those
> rare situations when the consequences of delay can be mitigated afterwards.
> 
> 3. Make them break isolation, with userspace being notified normally (ex:
> with a signal in the current implementation). I guess, can be used if
> somehow most of the causes will be eliminated.
> 
> 4. Prevent them from reaching the target CPU and make sure that whatever
> synchronization they are intended to cause, will happen when intended target
> CPU will enter to kernel later. Since we may have to synchronize things like
> code modification, some of this synchronization has to happen very early on
> kernel entry.
> 
> I am most interested in (4), so this is what was implemented in my version
> of the patch (and currently I am trying to achieve completeness and, if
> possible, elegance of the implementation).

Agree. (3) will be necessary as intermediate step. The proposed
improvement to Christoph's reply, in this thread, separates notification 
and syscall blockage. 

> I guess, if we want to add more controls, we can allow the user to choose
> either of those four options, or of a subset of them. In my opinion, if (4)
> will be available, and the only additional cost will be time for
> synchronization spent in breaking isolation procedure, there is not much
> need in the other three. Without (4) I don't think, the goal of providing
> consistent, interruption-free environment is achieved at all, so not
> implementing it would be very bad.

Agree.

> > > 2) For a type of application it is the case that certain interruptions
> > > can be tolerated, as long as they do not cross certain thresholds.
> > > For example, one loses the flexibility to read/write MSRs
> > > on the isolated CPUs (including performance counters,
> > > RDT/MBM type MSRs, frequency/power statistics) by
> > > forcing a "no interruptions" mode.
> > 
> > Does reading these really cause deferred actions by the OS? AFAICT you
> > could map these into memory as well as read them without OS activities.
> 
> Access to those is hardware/architecture-specific, and in many cases,
> indeed, there is no need to issue a syscall at all.
> 
> However for many applications the model with a helper task performing
> interactions with OS on a different core and exchanging data over shared
> memory may be sufficient, and it will also provide clear separation between
> operations that do require consistent timing and those that don't.

I see.

> > "Interruptions that can be tolerated".... Well that is the wild west of
> > "realtime" where you can define how much of a time slice is "real" and how
> > much can be use by other processes. I do not think that any of that should
> > come into this API.
> > 
> 
> To be honest, I have no idea, what can and can not be tolerated by
> applications other than what I am familiar with. Applications that I know,
> require no interruptions at all, so I want to implement that. I assume,
> someone already uses existing CPU isolation for the purpose of providing
> "nearly interrupt-less" environment.
> 
> I can imaging something like a task of controlling a large slow-updating LED
> display by bit-banging a strictly timed long serial message representing a
> frame or frame update. If interrupted, it may, depending on the protocol,
> corrupt the state of a single LED or fail to update until the end of the
> screen, but the next start of message will reset the state, and everything
> will work until the next interrupt. Maybe there are more realistic or useful
> examples.

Agree that "no interruptions" as a goal makes most sense. 

Can "whitelist" certain interruptions if necessary (to handle the MSR
read case), if user desires.

Marcelo Tosatti Jan. 21, 2021, 4:20 p.m. UTC | #7

Adding Nitesh to CC.

On Thu, Jan 21, 2021 at 12:51:41PM -0300, Marcelo Tosatti wrote:
> Hi Alex,
> 
> On Fri, Jan 15, 2021 at 10:35:14AM -0800, Alex Belits wrote:
> > On 1/15/21 05:24, Christoph Lameter wrote:
> > 
> > > ----------------------------------------------------------------------
> > > On Thu, 14 Jan 2021, Marcelo Tosatti wrote:
> > > 
> > > > > How does one do a oneshot flush of OS activities?
> > > > 
> > > >          ret = prctl(PR_TASK_ISOLATION_REQUEST, ISOL_F_QUIESCE, 0, 0, 0);
> > > >          if (ret == -1) {
> > > >                  perror("prctl PR_TASK_ISOLATION_REQUEST");
> > > >                  exit(0);
> > > >          }
> > > > 
> > > > > 
> > > > > I.e. I have a polling loop over numerous shared and I/o devices in user
> > > > > space and I want to make sure that the system is quite before I enter the
> > > > > loop.
> > > > 
> > > > You could configure things in two ways: with syscalls allowed or not.
> > > 
> > > Well syscalls that do not cause deferred processing like getting the time
> > > or determining the current cpu should be ok to use.
> > 
> > Some of those syscalls go through vdso, and don't enter the kernel --
> > nothing specific is necessary to allow them, and it would be pointless and
> > difficult to prevent them.
> > 
> > For syscalls that enter the kernel, it's often difficult to predict, if they
> > will or won't cause deferred processing, so I am afraid, it won't be
> > possible to provide a "safe" class of syscalls for this purpose and not end
> > up with something minimal like reading /sys and /proc. Right now isolation
> > only "allows" syscalls that exit isolation.
> 
> Christoph wrote:
> 
> "> Features that I think may be needed:
> > 
> > F_ISOL_QUIESCE                -> quiet down now but allow all OS activities. OS
> >                       activites reset flag
> > 
> > F_ISOL_BAREMETAL_HARD -> No OS interruptions. Fault on syscalls that
> >                       require such actions in the future.
> > 
> > F_ISOL_BAREMETAL_WARN -> Similar. Create a warning in the syslog when OS
> >                               services require delayed processing etc
> >                               but continue while resetting the flag.
> "
> 
> It seems the only difference between HARD and WARN (lets call it SOFT) 
> would be whether a notification is sent to userspace.
> 
> The definition 
> 
> "F_ISOL_BAREMETAL_HARD -> No OS interruptions. Fault on syscalls that
>                        require such actions in the future."
> 
> fails in the static_key_enable case: Alex's idea is to queue the i-cache
> flush if the remote task/cpu is in isolated mode (and perform the flush 
> when entering the kernel).
> 
> So even if userspace uses syscalls that do not require delayed
> processing, there are events which are out of control of the
> application and might require it.
> 
> So lets assume the application performs a number of syscalls on a
> given time critical codepath. 
> 
> Either the system is configured so that 
> the number/frequency of static_key_enable's is limited, or the cost of
> i-cache flushes must be accounted on that critical codepath.
> 
> Anyway, trying to improve Christoph's definition:
> 
> F_ISOL_QUIESCE                -> flush any pending operations that might cause
> 				 the CPU to be interrupted (ex: free's
> 				 per-CPU queues, sync MM statistics
> 				 counters, etc).
> 
> F_ISOL_ISOLATE		      -> inform the kernel that userspace is
> 				 entering isolated mode (see description
> 				 below on "ISOLATION MODES").
> 
> F_ISOL_UNISOLATE              -> inform the kernel that userspace is
> 				 leaving isolated mode.
> 
> F_ISOL_NOTIFY		      -> notification mode of isolation breakage
> 				 modes.
> 
> 
> Isolation modes:
> ---------------
> 
> There are two main types of isolation modes: 
> 
> - SOFT mode: does not prevent activities which might generate interruptions
> (such as CPU hotplug).
> 
> - HARD mode: prevents all blockable activities that might generate interruptions.
> Administrators can override this via /sys.
> 
> Notifications:
> -------------
> 
> Notification mode of isolation breakage can be configured as follows:
> 
> - None (default): No notification is performed by the kernel on isolation
>   breakage.
> 
> - Syslog: Isolation breakage is reported to syslog. 
> 
> (new modes can be added, for example signals).
> 
> A new feature can be added to disallow syscalls (by default syscalls
> are enabled, with reporting of pending activities that might cause
> an interruption in a VDSO).
> 
> How about that?
> 
> > F_ISOL_BAREMETAL_HARD -> No OS interruptions. Fault on syscalls that
> >                       require such actions in the future.
> > 
> > F_ISOL_BAREMETAL_WARN -> Similar. Create a warning in the syslog when OS
> >                               services require delayed processing etc
> >                               but continue while resetting the flag.
> 
> 
> 
> > It may be possible to set up a filter by the system (allowing few safe
> > things like reading /proc) and let the user expand it by adding combinations
> > of syscall / file descriptor. If some device is known to process operations
> > safely, user can open it and mark file descriptor as allowed, say, for
> > reading.
> 
> Makes sense.
> 
> > > And I already said that I want the system to quiet down and allow system
> > > calls. Some indication that deferred actions have occurred may be useful
> > > by f.e. resetting the flag.
> 
> Do you think reporting activities that add overhead (the i-cache flush
> in mind) to syscalls separately in the VDSO?
> 
> > I think, it should be possible to process a syscall, and if any deferred
> > action occurred, exit isolation on return to userspace. 
> 
> On the interface we are creating:
> 
> 	ret = syscall()...
> 	if (vdso.pending_activity) {
> 		prctl(PR_TASK_ISOLATION_REQUEST, F_ISOL_UNISOLATE, 0, 0);
> 		...
> 	}
> 
> Why would it be necessary to exit isolation on return to userspace
> again?
> 
> > Then there is a
> > question, how userspace should be notified about isolation being lost.
> > Normally this happens with a signal, but that is useful if we want syscall
> > to fail with EINTR, not to succeed. Make sure that signal arrives after
> > successful syscall return but before deferred action to happen? Sounds
> > convoluted. Maybe reflecting isolation status in vdso and having the user
> > check it there will be a good solution.
> 
> Why can't userspace enable/disable isolation mode (and the kernel only
> reports it) ?
> 
> I fail to see why the order of the events "isolated mode disablement"
> and "return to userspace" is critical.
> 
> > When I worked on my implementation I have encountered both a problem of
> > interaction with the rest of system from isolated task (at least simple
> > things as logging) and a problem of handling enter/exit from isolation on a
> > system when it's possible for a task to be interrupted early after entering
> > isolation due to various events that were still in progress on other CPUs.
> > 
> > I ended up implementing a manager/helper task that talks to tasks over a
> > socket (when they are not isolated) and over ring buffers in shared memory
> > (when they are isolated). While the current implementation is rather
> > limited, the intention is to delegate to it everything that isolated task
> > either can't do at all (like, writing logs) or that it would be cumbersome
> > to implement (like monitoring the state of task, determining presence of
> > deferred work after the task returned to userspace), etc.
> 
> Interesting. Are you considering opensourcing such library? Seems like a
> generic problem.
> 
> > It would be great if the complexity and amount of functionality of that
> > manager/helper task can be reduced, however I believe that having such a
> > task is a legitimate way of implementing things that otherwise would require
> > additional functionality in kernel.
> > 
> > > 
> > > > 1) Add a new isolation feature ISOL_F_BLOCK_SYSCALLS (to block certain
> > > > syscalls) along with ISOL_F_SETUP_NOTIF (to notify upon isolation
> > > > breaking):
> > > 
> > > Well come up with a use case for that .... I know mine. What you propose
> > > could be  useful for debugging for me but I would prefer the quiet down
> > > approach where I determine when I use some syscalls or not and will deal
> > > with the consequences.
> > 
> > For my purposes breaking isolation on syscalls and notifications about
> > isolation breaking is a very useful approach -- this is why I kept it
> > exactly as it was in the original implementation by Chris Metcalf.
> > 
> > In applications that I intend to use isolation for, the primary concern is
> > consistent time for running code in userspace, so syscalls should be only
> > issued when the task is specifically not in isolated mode. If the program
> > issues a syscall by mistake (and that may happen when some libraries are
> > used, or thread synchronization primitives are kept from non-isolated
> > version of the program, even though isolated tasks are not supposed to use
> > those), it means not only that deferred work may cause delay in the future,
> > but also that there is an additional time to be spent in kernel. This should
> > be immediately visible to the developer, and the best way to do it is by
> > breaking isolation on syscall immediately.
> 
> I guess you can do that by hooking a BPF program to cpu->is_isolated ==
> true (for development) and syscall entry.
> 
> > > > 
> > > > > Features that I think may be needed:
> > > > > 
> > > > > F_ISOL_QUIESCE		-> quiet down now but allow all OS activities. OS
> > > > > 			activites reset flag
> > > > > 
> > > > > F_ISOL_BAREMETAL_HARD	-> No OS interruptions. Fault on syscalls that
> > > > > 			require such actions in the future.
> > > > 
> > > > Question: why BAREMETAL ?
> > > 
> > > To disinguish it from "Realtime". We want the processor for ourselves
> > > without anything else running on it.
> > > 
> > > > Two comments:
> > > > 
> > > > 1) HARD mode could also block activities from different CPUs that can
> > > > interrupt this isolated CPU (for example CPU hotplug, or increasing
> > > > per-CPU trace buffer size).
> > > 
> > > Blocking? The app should fail if any deferred actions are triggered as a
> > > result of syscalls. It would give a warning with _WARN
> > 
> > There are many supposedly innocent things, nowhere at the scale of CPU
> > hotplug, that happen in a system and result in synchronization implemented
> > as an IPI to every online CPU. We should consider them to be an ordinary
> > occurrence, so there is a choice:
> > 
> > 1. Ignore them completely and allow them in isolated mode. This will delay
> > userspace with no indication and no isolation breaking.
> > 
> > 2. Allow them, and notify userspace afterwards (through vdso or through
> > userspace helper/manager over shared memory). This may be useful in those
> > rare situations when the consequences of delay can be mitigated afterwards.
> > 
> > 3. Make them break isolation, with userspace being notified normally (ex:
> > with a signal in the current implementation). I guess, can be used if
> > somehow most of the causes will be eliminated.
> > 
> > 4. Prevent them from reaching the target CPU and make sure that whatever
> > synchronization they are intended to cause, will happen when intended target
> > CPU will enter to kernel later. Since we may have to synchronize things like
> > code modification, some of this synchronization has to happen very early on
> > kernel entry.
> > 
> > I am most interested in (4), so this is what was implemented in my version
> > of the patch (and currently I am trying to achieve completeness and, if
> > possible, elegance of the implementation).
> 
> Agree. (3) will be necessary as intermediate step. The proposed
> improvement to Christoph's reply, in this thread, separates notification 
> and syscall blockage. 
> 
> > I guess, if we want to add more controls, we can allow the user to choose
> > either of those four options, or of a subset of them. In my opinion, if (4)
> > will be available, and the only additional cost will be time for
> > synchronization spent in breaking isolation procedure, there is not much
> > need in the other three. Without (4) I don't think, the goal of providing
> > consistent, interruption-free environment is achieved at all, so not
> > implementing it would be very bad.
> 
> Agree.
> 
> > > > 2) For a type of application it is the case that certain interruptions
> > > > can be tolerated, as long as they do not cross certain thresholds.
> > > > For example, one loses the flexibility to read/write MSRs
> > > > on the isolated CPUs (including performance counters,
> > > > RDT/MBM type MSRs, frequency/power statistics) by
> > > > forcing a "no interruptions" mode.
> > > 
> > > Does reading these really cause deferred actions by the OS? AFAICT you
> > > could map these into memory as well as read them without OS activities.
> > 
> > Access to those is hardware/architecture-specific, and in many cases,
> > indeed, there is no need to issue a syscall at all.
> > 
> > However for many applications the model with a helper task performing
> > interactions with OS on a different core and exchanging data over shared
> > memory may be sufficient, and it will also provide clear separation between
> > operations that do require consistent timing and those that don't.
> 
> I see.
> 
> > > "Interruptions that can be tolerated".... Well that is the wild west of
> > > "realtime" where you can define how much of a time slice is "real" and how
> > > much can be use by other processes. I do not think that any of that should
> > > come into this API.
> > > 
> > 
> > To be honest, I have no idea, what can and can not be tolerated by
> > applications other than what I am familiar with. Applications that I know,
> > require no interruptions at all, so I want to implement that. I assume,
> > someone already uses existing CPU isolation for the purpose of providing
> > "nearly interrupt-less" environment.
> > 
> > I can imaging something like a task of controlling a large slow-updating LED
> > display by bit-banging a strictly timed long serial message representing a
> > frame or frame update. If interrupted, it may, depending on the protocol,
> > corrupt the state of a single LED or fail to update until the end of the
> > screen, but the next start of message will reset the state, and everything
> > will work until the next interrupt. Maybe there are more realistic or useful
> > examples.
> 
> Agree that "no interruptions" as a goal makes most sense. 
> 
> Can "whitelist" certain interruptions if necessary (to handle the MSR
> read case), if user desires.
>

Marcelo Tosatti Jan. 22, 2021, 1:05 p.m. UTC | #8

On Thu, Jan 21, 2021 at 01:20:59PM -0300, Marcelo Tosatti wrote:
> 
> Adding Nitesh to CC.
> 
> On Thu, Jan 21, 2021 at 12:51:41PM -0300, Marcelo Tosatti wrote:
> > Hi Alex,
> > 
> > On Fri, Jan 15, 2021 at 10:35:14AM -0800, Alex Belits wrote:
> > > On 1/15/21 05:24, Christoph Lameter wrote:
> > > 
> > > > ----------------------------------------------------------------------
> > > > On Thu, 14 Jan 2021, Marcelo Tosatti wrote:
> > > > 
> > > > > > How does one do a oneshot flush of OS activities?
> > > > > 
> > > > >          ret = prctl(PR_TASK_ISOLATION_REQUEST, ISOL_F_QUIESCE, 0, 0, 0);
> > > > >          if (ret == -1) {
> > > > >                  perror("prctl PR_TASK_ISOLATION_REQUEST");
> > > > >                  exit(0);
> > > > >          }
> > > > > 
> > > > > > 
> > > > > > I.e. I have a polling loop over numerous shared and I/o devices in user
> > > > > > space and I want to make sure that the system is quite before I enter the
> > > > > > loop.
> > > > > 
> > > > > You could configure things in two ways: with syscalls allowed or not.
> > > > 
> > > > Well syscalls that do not cause deferred processing like getting the time
> > > > or determining the current cpu should be ok to use.
> > > 
> > > Some of those syscalls go through vdso, and don't enter the kernel --
> > > nothing specific is necessary to allow them, and it would be pointless and
> > > difficult to prevent them.
> > > 
> > > For syscalls that enter the kernel, it's often difficult to predict, if they
> > > will or won't cause deferred processing, so I am afraid, it won't be
> > > possible to provide a "safe" class of syscalls for this purpose and not end
> > > up with something minimal like reading /sys and /proc. Right now isolation
> > > only "allows" syscalls that exit isolation.
> > 
> > Christoph wrote:
> > 
> > "> Features that I think may be needed:
> > > 
> > > F_ISOL_QUIESCE                -> quiet down now but allow all OS activities. OS
> > >                       activites reset flag
> > > 
> > > F_ISOL_BAREMETAL_HARD -> No OS interruptions. Fault on syscalls that
> > >                       require such actions in the future.
> > > 
> > > F_ISOL_BAREMETAL_WARN -> Similar. Create a warning in the syslog when OS
> > >                               services require delayed processing etc
> > >                               but continue while resetting the flag.
> > "
> > 
> > It seems the only difference between HARD and WARN (lets call it SOFT) 
> > would be whether a notification is sent to userspace.
> > 
> > The definition 
> > 
> > "F_ISOL_BAREMETAL_HARD -> No OS interruptions. Fault on syscalls that
> >                        require such actions in the future."
> > 
> > fails in the static_key_enable case: Alex's idea is to queue the i-cache
> > flush if the remote task/cpu is in isolated mode (and perform the flush 
> > when entering the kernel).
> > 
> > So even if userspace uses syscalls that do not require delayed
> > processing, there are events which are out of control of the
> > application and might require it.
> > 
> > So lets assume the application performs a number of syscalls on a
> > given time critical codepath. 
> > 
> > Either the system is configured so that 
> > the number/frequency of static_key_enable's is limited, or the cost of
> > i-cache flushes must be accounted on that critical codepath.
> > 
> > Anyway, trying to improve Christoph's definition:
> > 
> > F_ISOL_QUIESCE                -> flush any pending operations that might cause
> > 				 the CPU to be interrupted (ex: free's
> > 				 per-CPU queues, sync MM statistics
> > 				 counters, etc).
> > 
> > F_ISOL_ISOLATE		      -> inform the kernel that userspace is
> > 				 entering isolated mode (see description
> > 				 below on "ISOLATION MODES").
> > 
> > F_ISOL_UNISOLATE              -> inform the kernel that userspace is
> > 				 leaving isolated mode.
> > 
> > F_ISOL_NOTIFY		      -> notification mode of isolation breakage
> > 				 modes.
> > 
> > 
> > Isolation modes:
> > ---------------
> > 
> > There are two main types of isolation modes: 
> > 
> > - SOFT mode: does not prevent activities which might generate interruptions
> > (such as CPU hotplug).
> > 
> > - HARD mode: prevents all blockable activities that might generate interruptions.
> > Administrators can override this via /sys.
> > 
> > Notifications:
> > -------------
> > 
> > Notification mode of isolation breakage can be configured as follows:
> > 
> > - None (default): No notification is performed by the kernel on isolation
> >   breakage.
> > 
> > - Syslog: Isolation breakage is reported to syslog. 
> > 
> > (new modes can be added, for example signals).
> > 
> > A new feature can be added to disallow syscalls (by default syscalls
> > are enabled, with reporting of pending activities that might cause
> > an interruption in a VDSO).

After discussion with Juri and Daniel, it became clearer that supporting
unmodified applications would be quite useful:

	- enter isolation mode
	- run unmodified application
	- leave isolation mode

This could work via an additional mode which goes through the quiesce
operation at every syscall return. Since this includes freeing per-CPU
pagevecs (therefore allocating per-CPU pagevecs at the next syscall),
it might considerably slowdown system startup (and cause MM related 
spinlocks contention).

Better ideas are appreciated.

Christoph Lameter (Ampere) Feb. 1, 2021, 10:48 a.m. UTC | #9

On Thu, 21 Jan 2021, Marcelo Tosatti wrote:

> Anyway, trying to improve Christoph's definition:
>
> F_ISOL_QUIESCE                -> flush any pending operations that might cause
> 				 the CPU to be interrupted (ex: free's
> 				 per-CPU queues, sync MM statistics
> 				 counters, etc).
>
> F_ISOL_ISOLATE		      -> inform the kernel that userspace is
> 				 entering isolated mode (see description
> 				 below on "ISOLATION MODES").
>
> F_ISOL_UNISOLATE              -> inform the kernel that userspace is
> 				 leaving isolated mode.
>
> F_ISOL_NOTIFY		      -> notification mode of isolation breakage
> 				 modes.

Looks good to me.


> Isolation modes:
> ---------------
>
> There are two main types of isolation modes:
>
> - SOFT mode: does not prevent activities which might generate interruptions
> (such as CPU hotplug).
>
> - HARD mode: prevents all blockable activities that might generate interruptions.
> Administrators can override this via /sys.


Yup.

>
> Notifications:
> -------------
>
> Notification mode of isolation breakage can be configured as follows:
>
> - None (default): No notification is performed by the kernel on isolation
>   breakage.
>
> - Syslog: Isolation breakage is reported to syslog.


- Abort with core dump

This is useful for debugging and for hard core bare metalers that never
want any interrupts.

One particular issue are page faults.  One would have to prefault the
binary executable functions in order to avoid "interruptions" through page
faults. Are these proper interrutions of the code? Certainly major faults
are but minor faults may be ok? Dunno.

In practice what I have often seen in such apps is that there is a "warm"
up mode where all critical functions are executed, all important variables
are touched and dummy I/Os are performed in order to populate the caches
and prefault all the data.I guess one would run these without isolation
first and then switch on some sort of isolation mode after warm up. So far
I think most people relied on the timer interrupt etc etc to be turned off
after a few secs of just running throught a polling loop without any OS
activities.

> > I ended up implementing a manager/helper task that talks to tasks over a
> > socket (when they are not isolated) and over ring buffers in shared memory
> > (when they are isolated). While the current implementation is rather
> > limited, the intention is to delegate to it everything that isolated task
> > either can't do at all (like, writing logs) or that it would be cumbersome
> > to implement (like monitoring the state of task, determining presence of
> > deferred work after the task returned to userspace), etc.
>
> Interesting. Are you considering opensourcing such library? Seems like a
> generic problem.

Well everyone swears on having the right implementation. The people I know
would not do any thing with a socket in such situations. They would only
use shared memory and direct access to I/O devices via SPDK and DPDK or
the RDMA subsystem.


> > > Blocking? The app should fail if any deferred actions are triggered as a
> > > result of syscalls. It would give a warning with _WARN
> >
> > There are many supposedly innocent things, nowhere at the scale of CPU
> > hotplug, that happen in a system and result in synchronization implemented
> > as an IPI to every online CPU. We should consider them to be an ordinary
> > occurrence, so there is a choice:
> >
> > 1. Ignore them completely and allow them in isolated mode. This will delay
> > userspace with no indication and no isolation breaking.
> >
> > 2. Allow them, and notify userspace afterwards (through vdso or through
> > userspace helper/manager over shared memory). This may be useful in those
> > rare situations when the consequences of delay can be mitigated afterwards.
> >
> > 3. Make them break isolation, with userspace being notified normally (ex:
> > with a signal in the current implementation). I guess, can be used if
> > somehow most of the causes will be eliminated.
> >
> > 4. Prevent them from reaching the target CPU and make sure that whatever
> > synchronization they are intended to cause, will happen when intended target
> > CPU will enter to kernel later. Since we may have to synchronize things like
> > code modification, some of this synchronization has to happen very early on
> > kernel entry.


Or move the actions to a different victim processor like done with rcu and
vmstat etc etc.

> >
> > I am most interested in (4), so this is what was implemented in my version
> > of the patch (and currently I am trying to achieve completeness and, if
> > possible, elegance of the implementation).
>
> Agree. (3) will be necessary as intermediate step. The proposed
> improvement to Christoph's reply, in this thread, separates notification
> and syscall blockage.

I guess the notification mode will take care of the way we handle these
interruptions.

Alex Belits Feb. 1, 2021, 12:47 p.m. UTC | #10

On 2/1/21 02:48, Christoph Lameter wrote:
>> Notifications:
>> -------------
>>
>> Notification mode of isolation breakage can be configured as follows:
>>
>> - None (default): No notification is performed by the kernel on isolation
>>    breakage.
>>
>> - Syslog: Isolation breakage is reported to syslog.

Syslog is intended for humans, and isn't useful for userspace software 
processing. Since there are at least some cases then isolation breaking 
is unavoidable on startup (benign race of isolation entering with 
isolation-breaking event, register-mapping page fault), I would rather 
allow completely automated processing of those events. Signal interface 
does that now, however I think, it would help to associate 
software-handled events with either software-identifiable "cause type" 
(ex: "scheduling timer" or "page fault") or more verbose human-readable 
"cause description" (ex: IPI received, and here is the sender CPU's 
stack dump that led to this IPI being sent).

The former ("cause") may be important for software (for example, it may 
want to have special processing of page faults for device registers), 
while the latter ("description") is more useful when it can be 
associated with particular event in userspace without manual log timing 
comparison and guesswork.

> 
> 
> - Abort with core dump

I would use an existing signal interface for that, with user-defined 
signal. The user can choose to handle the signal, ignore it, let it kill 
the task with or without a core dump.

Oh, and if user wants, he can use ptrace() to delegate this signal to 
some other process.

> 
> This is useful for debugging and for hard core bare metalers that never
> want any interrupts.
> 
> One particular issue are page faults.  One would have to prefault the
> binary executable functions in order to avoid "interruptions" through page
> faults. Are these proper interrutions of the code? Certainly major faults
> are but minor faults may be ok? Dunno.
> 
> In practice what I have often seen in such apps is that there is a "warm"
> up mode where all critical functions are executed, all important variables
> are touched and dummy I/Os are performed in order to populate the caches
> and prefault all the data.I guess one would run these without isolation
> first and then switch on some sort of isolation mode after warm up. So far
> I think most people relied on the timer interrupt etc etc to be turned off
> after a few secs of just running throught a polling loop without any OS
> activities.

This is usually done not as much for page preloading but for cache. 
There is mlock() and mlockall() that load and lock pages explicitly. One 
exception is device registers -- they may remain unmapped until accessed.

I can often see a pattern when application enters isolation, calls 
low-level library such as ODP, gets a page fault, leaves and re-enters 
isolation, and then everything is running perfectly because everything 
is mapped. However in those cases mlockall() is done before entering 
isolation, so regular memory mapping is already there.

> 
>>> I ended up implementing a manager/helper task that talks to tasks over a
>>> socket (when they are not isolated) and over ring buffers in shared memory
>>> (when they are isolated). While the current implementation is rather
>>> limited, the intention is to delegate to it everything that isolated task
>>> either can't do at all (like, writing logs) or that it would be cumbersome
>>> to implement (like monitoring the state of task, determining presence of
>>> deferred work after the task returned to userspace), etc.
>>
>> Interesting. Are you considering opensourcing such library? Seems like a
>> generic problem.

It's already open source, https://github.com/abelits/libtmc

It still needs some work. At the moment it does more than I would prefer 
because it tries to detect possible problems, such as running timers, 
and at the same time does not provide some obviously useful things like 
asynchronous interface to arbitrary file I/O.

I also want to allow the use of some generic interface to triggering 
interrupts from isolated task to the manager (through, say, a sacrifice 
of a single GPIO), so if this option is available, the manager won't 
have to do all that polling.

> 
> Well everyone swears on having the right implementation. The people I know
> would not do any thing with a socket in such situations. They would only
> use shared memory and direct access to I/O devices via SPDK and DPDK or
> the RDMA subsystem.
> 

Same applies to me. My library uses sockets to communicate when the task 
is not isolated, and it will be necessary if we want to have a dedicated 
manager process instead of a manager thread in every process. I would 
prefer initiating a connection with a manager through a socket, and only 
after that succeeds, assume that I can use any particular part of shared 
memory (because it means that manager allocated it for me, and no one 
else will race with me trying to touch it).

> 
>>>> Blocking? The app should fail if any deferred actions are triggered as a
>>>> result of syscalls. It would give a warning with _WARN
>>>
>>> There are many supposedly innocent things, nowhere at the scale of CPU
>>> hotplug, that happen in a system and result in synchronization implemented
>>> as an IPI to every online CPU. We should consider them to be an ordinary
>>> occurrence, so there is a choice:
>>>
>>> 1. Ignore them completely and allow them in isolated mode. This will delay
>>> userspace with no indication and no isolation breaking.
>>>
>>> 2. Allow them, and notify userspace afterwards (through vdso or through
>>> userspace helper/manager over shared memory). This may be useful in those
>>> rare situations when the consequences of delay can be mitigated afterwards.
>>>
>>> 3. Make them break isolation, with userspace being notified normally (ex:
>>> with a signal in the current implementation). I guess, can be used if
>>> somehow most of the causes will be eliminated.
>>>
>>> 4. Prevent them from reaching the target CPU and make sure that whatever
>>> synchronization they are intended to cause, will happen when intended target
>>> CPU will enter to kernel later. Since we may have to synchronize things like
>>> code modification, some of this synchronization has to happen very early on
>>> kernel entry.
> 
> 
> Or move the actions to a different victim processor like done with rcu and
> vmstat etc etc.

If possible. For most of those things everything can be moved to other 
CPUs when entering for isolation, or not allowed on CPUs intended for 
isolation in the first place (how it's mostly done now). The troublesome 
sources of interruption are things that are legitimately supposed to be 
done on all CPUs at once to synchronize some important kind of state, 
and now we want to delay them on some CPUs until the end of isolation.

>>>
>>> I am most interested in (4), so this is what was implemented in my version
>>> of the patch (and currently I am trying to achieve completeness and, if
>>> possible, elegance of the implementation).
>>
>> Agree. (3) will be necessary as intermediate step. The proposed
>> improvement to Christoph's reply, in this thread, separates notification
>> and syscall blockage.
> 
> I guess the notification mode will take care of the way we handle these
> interruptions.
> 

I think, development should go in parallel -- to have a "delayed 
synchronization on entry" mechanism that allows "no-interruption mode" 
(4) to work given that all interruptions are dealt with (that won't work 
perfectly at first because there are still "unprocessed" sources of 
interruptions) and a notification mechanism that will allow us to find 
and properly process them as (3), so we can exclude them and allow (4). 
Since (4) still requires somewhat intrusive architecture-specific 
changes, there may be some time when (4) will be only available on some 
CPUs, but (3) will work on everything.

Marcelo Tosatti Feb. 1, 2021, 6:20 p.m. UTC | #11

On Mon, Feb 01, 2021 at 10:48:18AM +0000, Christoph Lameter wrote:
> On Thu, 21 Jan 2021, Marcelo Tosatti wrote:
> 
> > Anyway, trying to improve Christoph's definition:
> >
> > F_ISOL_QUIESCE                -> flush any pending operations that might cause
> > 				 the CPU to be interrupted (ex: free's
> > 				 per-CPU queues, sync MM statistics
> > 				 counters, etc).
> >
> > F_ISOL_ISOLATE		      -> inform the kernel that userspace is
> > 				 entering isolated mode (see description
> > 				 below on "ISOLATION MODES").
> >
> > F_ISOL_UNISOLATE              -> inform the kernel that userspace is
> > 				 leaving isolated mode.
> >
> > F_ISOL_NOTIFY		      -> notification mode of isolation breakage
> > 				 modes.
> 
> Looks good to me.
> 
> 
> > Isolation modes:
> > ---------------
> >
> > There are two main types of isolation modes:
> >
> > - SOFT mode: does not prevent activities which might generate interruptions
> > (such as CPU hotplug).
> >
> > - HARD mode: prevents all blockable activities that might generate interruptions.
> > Administrators can override this via /sys.
> 
> 
> Yup.
> 
> >
> > Notifications:
> > -------------
> >
> > Notification mode of isolation breakage can be configured as follows:
> >
> > - None (default): No notification is performed by the kernel on isolation
> >   breakage.
> >
> > - Syslog: Isolation breakage is reported to syslog.
> 
> 
> - Abort with core dump
> 
> This is useful for debugging and for hard core bare metalers that never
> want any interrupts.
> 
> One particular issue are page faults.  One would have to prefault the
> binary executable functions in order to avoid "interruptions" through page
> faults. Are these proper interrutions of the code? Certainly major faults
> are but minor faults may be ok? Dunno.

mlockall man page:

Real-time processes that are using mlockall() to prevent delays on
page faults should reserve enough locked stack pages before entering
the time-critical section, so that no page fault can be caused
by function calls. This can be achieved by calling a function that
allocates a sufficiently large automatic variable (an array) and writes
to the memory occupied by this array in order to touch these stack
pages. This way, enough pages will be mapped for the stack and can be
locked into RAM. The dummy writes ensure that not even copy-on-write
page faults can occur in the critical section.

> In practice what I have often seen in such apps is that there is a "warm"
> up mode where all critical functions are executed, all important variables
> are touched and dummy I/Os are performed in order to populate the caches
> and prefault all the data.I guess one would run these without isolation
> first and then switch on some sort of isolation mode after warm up. So far
> I think most people relied on the timer interrupt etc etc to be turned off
> after a few secs of just running throught a polling loop without any OS
> activities.

Yep.

> > > I ended up implementing a manager/helper task that talks to tasks over a
> > > socket (when they are not isolated) and over ring buffers in shared memory
> > > (when they are isolated). While the current implementation is rather
> > > limited, the intention is to delegate to it everything that isolated task
> > > either can't do at all (like, writing logs) or that it would be cumbersome
> > > to implement (like monitoring the state of task, determining presence of
> > > deferred work after the task returned to userspace), etc.
> >
> > Interesting. Are you considering opensourcing such library? Seems like a
> > generic problem.
> 
> Well everyone swears on having the right implementation. The people I know
> would not do any thing with a socket in such situations. They would only
> use shared memory and direct access to I/O devices via SPDK and DPDK or
> the RDMA subsystem.
> 
> 
> > > > Blocking? The app should fail if any deferred actions are triggered as a
> > > > result of syscalls. It would give a warning with _WARN
> > >
> > > There are many supposedly innocent things, nowhere at the scale of CPU
> > > hotplug, that happen in a system and result in synchronization implemented
> > > as an IPI to every online CPU. We should consider them to be an ordinary
> > > occurrence, so there is a choice:
> > >
> > > 1. Ignore them completely and allow them in isolated mode. This will delay
> > > userspace with no indication and no isolation breaking.
> > >
> > > 2. Allow them, and notify userspace afterwards (through vdso or through
> > > userspace helper/manager over shared memory). This may be useful in those
> > > rare situations when the consequences of delay can be mitigated afterwards.
> > >
> > > 3. Make them break isolation, with userspace being notified normally (ex:
> > > with a signal in the current implementation). I guess, can be used if
> > > somehow most of the causes will be eliminated.
> > >
> > > 4. Prevent them from reaching the target CPU and make sure that whatever
> > > synchronization they are intended to cause, will happen when intended target
> > > CPU will enter to kernel later. Since we may have to synchronize things like
> > > code modification, some of this synchronization has to happen very early on
> > > kernel entry.
> 
> 
> Or move the actions to a different victim processor like done with rcu and
> vmstat etc etc.
> 
> > >
> > > I am most interested in (4), so this is what was implemented in my version
> > > of the patch (and currently I am trying to achieve completeness and, if
> > > possible, elegance of the implementation).
> >
> > Agree. (3) will be necessary as intermediate step. The proposed
> > improvement to Christoph's reply, in this thread, separates notification
> > and syscall blockage.
> 
> I guess the notification mode will take care of the way we handle these
> interruptions.

[RFC] tentative prctl task isolation interface

Commit Message

Comments

Patch