mbox series

[v2,0/8] Fork brute force attack mitigation

Message ID 20201025134540.3770-1-john.wood@gmx.com (mailing list archive)
Headers show
Series Fork brute force attack mitigation | expand

Message

John Wood Oct. 25, 2020, 1:45 p.m. UTC
Attacks against vulnerable userspace applications with the purpose to break
ASLR or bypass canaries traditionaly use some level of brute force with the
help of the fork system call. This is possible since when creating a new
process using fork its memory contents are the same as those of the parent
process (the process that called the fork system call). So, the attacker
can test the memory infinite times to find the correct memory values or the
correct memory addresses without worrying about crashing the application.

Based on the above scenario it would be nice to have this detected and
mitigated, and this is the goal of this patch serie.

Other implementations
---------------------

The public version of grsecurity, as a summary, is based on the idea of
delay the fork system call if a child died due to a fatal error. This has
some issues:

1.- Bad practices: Add delays to the kernel is, in general, a bad idea.

2.- Weak points: This protection can be bypassed using two different
    methods since it acts only when the fork is called after a child has
    crashed.

    2.1.- Bypass 1: So, it would still be possible for an attacker to fork
          a big amount of children (in the order of thousands), then probe
          all of them, and finally wait the protection time before repeat
          the steps.

    2.2.- Bypass 2: This method is based on the idea that the protection
          doesn't act if the parent crashes. So, it would still be possible
          for an attacker to fork a process and probe itself. Then, fork
          the child process and probe itself again. This way, these steps
          can be repeated infinite times without any mitigation.

This implementation
-------------------

The main idea behind this implementation is to improve the existing ones
focusing on the weak points annotated before. The solution for the first
bypass method is to detect a fast crash rate instead of only one simple
crash. For the second bypass method the solution is to detect both the
crash of parent and child processes. Moreover, as a mitigation method it is
better to kill all the offending tasks involve in the attack instead of use
delays.

So, the solution to the two bypass methods previously commented is to use
some statistical data shared across all the processes that can have the
same memory contents. Or in other words, a statistical data shared between
all the fork hierarchy processes after an execve system call.

The purpose of these statistics is to compute the application crash period
in order to detect an attack. This crash period is the time between the
execve system call and the first fault or the time between two consecutives
faults, but this has a drawback. If an application crashes once quickly
from the execve system call or crashes twice in a short period of time for
some reason, a false positive attack will be triggered. To avoid this
scenario the shared statistical data holds a list of the i last crashes
timestamps and the application crash period is computed as follows:

crash_period = (n_last_timestamp - n_minus_i_timestamp) / i;

This ways, the size of the last crashes timestamps list allows to fine
tuning the detection sensibility.

When this crash period falls under a certain threshold there is a clear
signal that something malicious is happening. Once detected, the mitigation
only kills the processes that share the same statistical data and so, all
the tasks that can have the same memory contents. This way, an attack is
rejected.

1.- Per system enabling: This feature can be enabled at build time using
    the CONFIG_SECURITY_FORK_BRUTE option or using the visual config
    application under the following menu:

    Security options ---> Fork brute force attack detection and mitigation

2.- Per process enabling/disabling: To allow that specific applications can
    turn off or turn on the detection and mitigation of a fork brute force
    attack when required, there are two new prctls.

    prctl(PR_SECURITY_FORK_BRUTE_ENABLE, 0, 0, 0, 0)
    prctl(PR_SECURITY_FORK_BRUTE_DISABLE, 0, 0, 0, 0)

3.- Fine tuning: To customize the detection's sensibility there are two new
    sysctl attributes that allow to set the last crashes timestamps list
    size and the application crash period threshold (in milliseconds). Both
    are accessible through the following files respectively.

    /proc/sys/kernel/brute/timestamps_list_size
    /proc/sys/kernel/brute/crash_period_threshold

    The list size allows to avoid false positives due to crashes unrelated
    with a real attack. The period threshold sets the time limit to detect
    an attack. And, since a fork brute force attack will be detected if the
    application crash period falls under this threshold, the higher this
    value, the more sensitive the detection will be.

So, knowing all this information I will explain now the different patches:

The 1/8 patch defines a new LSM hook to get the fatal signal of a task.
This will be useful during the attack detection phase.

The 2/8 patch defines a new LSM and manages the statistical data shared by
all the fork hierarchy processes.

The 3/8 patch adds the sysctl attributes to fine tuning the detection.

Patchs 4/8 and 5/8 detect and mitigate a fork brute force attack.

Patch 6/8 adds the prctls to allow per process enabling/disabling.

Patch 7/8 adds the documentation to explain this implementation.

Patch 8/8 updates the maintainers file.

This patch series is a task of the KSPP [1] and can also be accessed from
my github tree [2] in the "brute_v2" branch.

[1] https://github.com/KSPP/linux/issues/39
[2] https://github.com/johwood/linux/

The first version can be found in:

https://lore.kernel.org/kernel-hardening/20200910202107.3799376-1-keescook@chromium.org/

Changelog RFC -> v2
-------------------
- Rename this feature with a more appropiate name (Jann Horn, Kees Cook).
- Convert the code to an LSM (Kees Cook).
- Add locking  to avoid data races (Jann Horn).
- Add a new LSM hook to get the fatal signal of a task (Jann Horn, Kees
  Cook).
- Add the last crashes timestamps list to avoid false positives in the
  attack detection (Jann Horn).
- Use "period" instead of "rate" (Jann Horn).
- Other minor changes suggested (Jann Horn, Kees Cook).

John Wood (8):
  security: Add LSM hook at the point where a task gets a fatal signal
  security/brute: Define a LSM and manage statistical data
  security/brute: Add sysctl attributes to allow detection fine tuning
  security/brute: Detect a fork brute force attack
  security/brute: Mitigate a fork brute force attack
  security/brute: Add prctls to enable/disable the fork attack detection
  Documentation: Add documentation for the Brute LSM
  MAINTAINERS: Add a new entry for the Brute LSM

 Documentation/admin-guide/LSM/Brute.rst | 118 ++++
 Documentation/admin-guide/LSM/index.rst |   1 +
 MAINTAINERS                             |   7 +
 include/brute/brute.h                   |  16 +
 include/linux/lsm_hook_defs.h           |   1 +
 include/linux/lsm_hooks.h               |   4 +
 include/linux/security.h                |   4 +
 include/uapi/linux/prctl.h              |   4 +
 kernel/signal.c                         |   1 +
 kernel/sys.c                            |   8 +
 security/Kconfig                        |  11 +-
 security/Makefile                       |   4 +
 security/brute/Kconfig                  |  13 +
 security/brute/Makefile                 |   2 +
 security/brute/brute.c                  | 749 ++++++++++++++++++++++++
 security/security.c                     |   5 +
 16 files changed, 943 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/admin-guide/LSM/Brute.rst
 create mode 100644 include/brute/brute.h
 create mode 100644 security/brute/Kconfig
 create mode 100644 security/brute/Makefile
 create mode 100644 security/brute/brute.c

--
2.25.1

Comments

John Wood Nov. 6, 2020, 3:54 p.m. UTC | #1
Hi,

Has anyone had time to review this patch serie? Any comments on this?

Regards.

On Sun, Oct 25, 2020 at 02:45:32PM +0100, John Wood wrote:
> Attacks against vulnerable userspace applications with the purpose to break
> ASLR or bypass canaries traditionaly use some level of brute force with the
> help of the fork system call. This is possible since when creating a new
> process using fork its memory contents are the same as those of the parent
> process (the process that called the fork system call). So, the attacker
> can test the memory infinite times to find the correct memory values or the
> correct memory addresses without worrying about crashing the application.
>
> Based on the above scenario it would be nice to have this detected and
> mitigated, and this is the goal of this patch serie.
>
> Other implementations
> ---------------------
>
> The public version of grsecurity, as a summary, is based on the idea of
> delay the fork system call if a child died due to a fatal error. This has
> some issues:
>
> 1.- Bad practices: Add delays to the kernel is, in general, a bad idea.
>
> 2.- Weak points: This protection can be bypassed using two different
>     methods since it acts only when the fork is called after a child has
>     crashed.
>
>     2.1.- Bypass 1: So, it would still be possible for an attacker to fork
>           a big amount of children (in the order of thousands), then probe
>           all of them, and finally wait the protection time before repeat
>           the steps.
>
>     2.2.- Bypass 2: This method is based on the idea that the protection
>           doesn't act if the parent crashes. So, it would still be possible
>           for an attacker to fork a process and probe itself. Then, fork
>           the child process and probe itself again. This way, these steps
>           can be repeated infinite times without any mitigation.
>
> This implementation
> -------------------
>
> The main idea behind this implementation is to improve the existing ones
> focusing on the weak points annotated before. The solution for the first
> bypass method is to detect a fast crash rate instead of only one simple
> crash. For the second bypass method the solution is to detect both the
> crash of parent and child processes. Moreover, as a mitigation method it is
> better to kill all the offending tasks involve in the attack instead of use
> delays.
>
> So, the solution to the two bypass methods previously commented is to use
> some statistical data shared across all the processes that can have the
> same memory contents. Or in other words, a statistical data shared between
> all the fork hierarchy processes after an execve system call.
>
> The purpose of these statistics is to compute the application crash period
> in order to detect an attack. This crash period is the time between the
> execve system call and the first fault or the time between two consecutives
> faults, but this has a drawback. If an application crashes once quickly
> from the execve system call or crashes twice in a short period of time for
> some reason, a false positive attack will be triggered. To avoid this
> scenario the shared statistical data holds a list of the i last crashes
> timestamps and the application crash period is computed as follows:
>
> crash_period = (n_last_timestamp - n_minus_i_timestamp) / i;
>
> This ways, the size of the last crashes timestamps list allows to fine
> tuning the detection sensibility.
>
> When this crash period falls under a certain threshold there is a clear
> signal that something malicious is happening. Once detected, the mitigation
> only kills the processes that share the same statistical data and so, all
> the tasks that can have the same memory contents. This way, an attack is
> rejected.
>
> 1.- Per system enabling: This feature can be enabled at build time using
>     the CONFIG_SECURITY_FORK_BRUTE option or using the visual config
>     application under the following menu:
>
>     Security options ---> Fork brute force attack detection and mitigation
>
> 2.- Per process enabling/disabling: To allow that specific applications can
>     turn off or turn on the detection and mitigation of a fork brute force
>     attack when required, there are two new prctls.
>
>     prctl(PR_SECURITY_FORK_BRUTE_ENABLE, 0, 0, 0, 0)
>     prctl(PR_SECURITY_FORK_BRUTE_DISABLE, 0, 0, 0, 0)
>
> 3.- Fine tuning: To customize the detection's sensibility there are two new
>     sysctl attributes that allow to set the last crashes timestamps list
>     size and the application crash period threshold (in milliseconds). Both
>     are accessible through the following files respectively.
>
>     /proc/sys/kernel/brute/timestamps_list_size
>     /proc/sys/kernel/brute/crash_period_threshold
>
>     The list size allows to avoid false positives due to crashes unrelated
>     with a real attack. The period threshold sets the time limit to detect
>     an attack. And, since a fork brute force attack will be detected if the
>     application crash period falls under this threshold, the higher this
>     value, the more sensitive the detection will be.
>
> So, knowing all this information I will explain now the different patches:
>
> The 1/8 patch defines a new LSM hook to get the fatal signal of a task.
> This will be useful during the attack detection phase.
>
> The 2/8 patch defines a new LSM and manages the statistical data shared by
> all the fork hierarchy processes.
>
> The 3/8 patch adds the sysctl attributes to fine tuning the detection.
>
> Patchs 4/8 and 5/8 detect and mitigate a fork brute force attack.
>
> Patch 6/8 adds the prctls to allow per process enabling/disabling.
>
> Patch 7/8 adds the documentation to explain this implementation.
>
> Patch 8/8 updates the maintainers file.
>
> This patch series is a task of the KSPP [1] and can also be accessed from
> my github tree [2] in the "brute_v2" branch.
>
> [1] https://github.com/KSPP/linux/issues/39
> [2] https://github.com/johwood/linux/
>
> The first version can be found in:
>
> https://lore.kernel.org/kernel-hardening/20200910202107.3799376-1-keescook@chromium.org/
>
> Changelog RFC -> v2
> -------------------
> - Rename this feature with a more appropiate name (Jann Horn, Kees Cook).
> - Convert the code to an LSM (Kees Cook).
> - Add locking  to avoid data races (Jann Horn).
> - Add a new LSM hook to get the fatal signal of a task (Jann Horn, Kees
>   Cook).
> - Add the last crashes timestamps list to avoid false positives in the
>   attack detection (Jann Horn).
> - Use "period" instead of "rate" (Jann Horn).
> - Other minor changes suggested (Jann Horn, Kees Cook).
>
> John Wood (8):
>   security: Add LSM hook at the point where a task gets a fatal signal
>   security/brute: Define a LSM and manage statistical data
>   security/brute: Add sysctl attributes to allow detection fine tuning
>   security/brute: Detect a fork brute force attack
>   security/brute: Mitigate a fork brute force attack
>   security/brute: Add prctls to enable/disable the fork attack detection
>   Documentation: Add documentation for the Brute LSM
>   MAINTAINERS: Add a new entry for the Brute LSM
>
>  Documentation/admin-guide/LSM/Brute.rst | 118 ++++
>  Documentation/admin-guide/LSM/index.rst |   1 +
>  MAINTAINERS                             |   7 +
>  include/brute/brute.h                   |  16 +
>  include/linux/lsm_hook_defs.h           |   1 +
>  include/linux/lsm_hooks.h               |   4 +
>  include/linux/security.h                |   4 +
>  include/uapi/linux/prctl.h              |   4 +
>  kernel/signal.c                         |   1 +
>  kernel/sys.c                            |   8 +
>  security/Kconfig                        |  11 +-
>  security/Makefile                       |   4 +
>  security/brute/Kconfig                  |  13 +
>  security/brute/Makefile                 |   2 +
>  security/brute/brute.c                  | 749 ++++++++++++++++++++++++
>  security/security.c                     |   5 +
>  16 files changed, 943 insertions(+), 5 deletions(-)
>  create mode 100644 Documentation/admin-guide/LSM/Brute.rst
>  create mode 100644 include/brute/brute.h
>  create mode 100644 security/brute/Kconfig
>  create mode 100644 security/brute/Makefile
>  create mode 100644 security/brute/brute.c
>
> --
> 2.25.1
>
Kees Cook Nov. 11, 2020, 12:10 a.m. UTC | #2
On Sun, Oct 25, 2020 at 02:45:32PM +0100, John Wood wrote:
> Attacks against vulnerable userspace applications with the purpose to break
> ASLR or bypass canaries traditionaly use some level of brute force with the
> help of the fork system call. This is possible since when creating a new
> process using fork its memory contents are the same as those of the parent
> process (the process that called the fork system call). So, the attacker
> can test the memory infinite times to find the correct memory values or the
> correct memory addresses without worrying about crashing the application.
> 
> Based on the above scenario it would be nice to have this detected and
> mitigated, and this is the goal of this patch serie.

Thanks for preparing the v2! I spent some time looking at this today,
and I really like how it has been rearranged into an LSM. This feels
much more natural.

Various notes:

The locking isn't right; it'll trip with CONFIG_PROVE_LOCKING=y.
Here's the giant splat:


[    8.205146] brute: Fork brute force attack detected
[    8.206821] 
[    8.207317] =====================================================
[    8.209392] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
[    8.211852] 5.10.0-rc3 #2 Not tainted
[    8.213215] -----------------------------------------------------
[    8.215213] runc/2505 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[    8.217387] ffffffff97206098 (tasklist_lock){.+.+}-{2:2}, at: brute_task_fatal_signal.cold+0x49/0xe0
[    8.219554] 
[    8.219554] and this task is already holding:
[    8.220891] ffff9a6a10eb8318 (&stats->lock){+.-.}-{2:2}, at: brute_task_fatal_signal+0x97/0x210
[    8.222856] which would create a new lock dependency:
[    8.223943]  (&stats->lock){+.-.}-{2:2} -> (tasklist_lock){.+.+}-{2:2}
[    8.225360] 
[    8.225360] but this new dependency connects a SOFTIRQ-irq-safe lock:
[    8.227081]  (&stats->lock){+.-.}-{2:2}
[    8.227084] 
[    8.227084] ... which became SOFTIRQ-irq-safe at:
[    8.228732]   lock_acquire+0x13f/0x3f0
[    8.229381]   _raw_spin_lock+0x2c/0x40
[    8.230024]   brute_task_free+0x22/0x90
[    8.230619]   security_task_free+0x22/0x50
[    8.231260]   __put_task_struct+0x58/0x140
[    8.231924]   rcu_core+0x2bb/0x620
[    8.232492]   __do_softirq+0x156/0x4d9
[    8.233080]   asm_call_irq_on_stack+0x12/0x20
[    8.233759]   do_softirq_own_stack+0x5b/0x70
[    8.234425]   irq_exit_rcu+0x9f/0xe0
[    8.235221]   sysvec_apic_timer_interrupt+0x43/0xa0
[    8.235976]   asm_sysvec_apic_timer_interrupt+0x12/0x20
[    8.236794]   _raw_write_unlock_irq+0x2c/0x40
[    8.237465]   copy_process+0x15e6/0x1cb0
[    8.237966]   kernel_clone+0x9b/0x3f0
[    8.238413]   kernel_thread+0x55/0x70
[    8.238861]   call_usermodehelper_exec_work+0x77/0xb0
[    8.239466]   process_one_work+0x23e/0x580
[    8.239969]   worker_thread+0x55/0x3c0
[    8.240425]   kthread+0x141/0x160
[    8.240833]   ret_from_fork+0x22/0x30
[    8.241276] 
[    8.241276] to a SOFTIRQ-irq-unsafe lock:
[    8.241939]  (tasklist_lock){.+.+}-{2:2}
[    8.241940] 
[    8.241940] ... which became SOFTIRQ-irq-unsafe at:
[    8.243180] ...
[    8.243182]   lock_acquire+0x13f/0x3f0
[    8.243853]   _raw_read_lock+0x5d/0x70
[    8.244307]   do_wait+0xd2/0x2e0
[    8.244706]   kernel_wait+0x49/0x90
[    8.245126]   call_usermodehelper_exec_work+0x61/0xb0
[    8.245748]   process_one_work+0x23e/0x580
[    8.246238]   worker_thread+0x55/0x3c0
[    8.246685]   kthread+0x141/0x160
[    8.247086]   ret_from_fork+0x22/0x30
[    8.247522] 
[    8.247522] other info that might help us debug this:
[    8.247522] 
[    8.248480]  Possible interrupt unsafe locking scenario:
[    8.248480] 
[    8.249289]        CPU0                    CPU1
[    8.249839]        ----                    ----
[    8.250386]   lock(tasklist_lock);
[    8.250811]                                local_irq_disable();
[    8.251525]                                lock(&stats->lock);
[    8.252227]                                lock(tasklist_lock);
[    8.252938]   <Interrupt>
[    8.253249]     lock(&stats->lock);
[    8.253663] 
[    8.253663]  *** DEADLOCK ***
[    8.253663] 
[    8.254362] 1 lock held by runc/2505:
[    8.254800]  #0: ffff9a6a10eb8318 (&stats->lock){+.-.}-{2:2}, at: brute_task_fatal_signal+0x97/0x210
[    8.255872] 
[    8.255872] the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
[    8.256931] -> (&stats->lock){+.-.}-{2:2} {
[    8.257432]    HARDIRQ-ON-W at:
[    8.257809]                     lock_acquire+0x13f/0x3f0
[    8.258442]                     _raw_spin_lock+0x2c/0x40
[    8.259069]                     brute_task_free+0x22/0x90
[    8.259710]                     security_task_free+0x22/0x50
[    8.260384]                     __put_task_struct+0x58/0x140
[    8.261078]                     rcu_core+0x2bb/0x620
[    8.261677]                     __do_softirq+0x156/0x4d9
[    8.262304]                     asm_call_irq_on_stack+0x12/0x20
[    8.263010]                     do_softirq_own_stack+0x5b/0x70
[    8.263699]                     irq_exit_rcu+0x9f/0xe0
[    8.264315]                     sysvec_apic_timer_interrupt+0x43/0xa0
[    8.265085]                     asm_sysvec_apic_timer_interrupt+0x12/0x20
[    8.265886]                     _raw_write_unlock_irq+0x2c/0x40
[    8.266580]                     copy_process+0x15e6/0x1cb0
[    8.267218]                     kernel_clone+0x9b/0x3f0
[    8.267832]                     kernel_thread+0x55/0x70
[    8.268444]                     call_usermodehelper_exec_work+0x77/0xb0
[    8.269234]                     process_one_work+0x23e/0x580
[    8.269912]                     worker_thread+0x55/0x3c0
[    8.270546]                     kthread+0x141/0x160
[    8.271144]                     ret_from_fork+0x22/0x30
[    8.271771]    IN-SOFTIRQ-W at:
[    8.272146]                     lock_acquire+0x13f/0x3f0
[    8.272780]                     _raw_spin_lock+0x2c/0x40
[    8.273398]                     brute_task_free+0x22/0x90
[    8.274022]                     security_task_free+0x22/0x50
[    8.274690]                     __put_task_struct+0x58/0x140
[    8.275358]                     rcu_core+0x2bb/0x620
[    8.275937]                     __do_softirq+0x156/0x4d9
[    8.276559]                     asm_call_irq_on_stack+0x12/0x20
[    8.277262]                     do_softirq_own_stack+0x5b/0x70
[    8.277948]                     irq_exit_rcu+0x9f/0xe0
[    8.278561]                     sysvec_apic_timer_interrupt+0x43/0xa0
[    8.279324]                     asm_sysvec_apic_timer_interrupt+0x12/0x20
[    8.280121]                     _raw_write_unlock_irq+0x2c/0x40
[    8.280845]                     copy_process+0x15e6/0x1cb0
[    8.281491]                     kernel_clone+0x9b/0x3f0
[    8.282117]                     kernel_thread+0x55/0x70
[    8.282734]                     call_usermodehelper_exec_work+0x77/0xb0
[    8.283514]                     process_one_work+0x23e/0x580
[    8.284181]                     worker_thread+0x55/0x3c0
[    8.284802]                     kthread+0x141/0x160
[    8.285381]                     ret_from_fork+0x22/0x30
[    8.285998]    INITIAL USE at:
[    8.286365]                    lock_acquire+0x13f/0x3f0
[    8.286975]                    _raw_spin_lock_irqsave+0x3b/0x60
[    8.287665]                    brute_share_stats+0x17/0x70
[    8.288311]                    brute_task_alloc+0x65/0x70
[    8.288948]                    security_task_alloc+0x44/0xd0
[    8.289605]                    copy_process+0x789/0x1cb0
[    8.290223]                    kernel_clone+0x9b/0x3f0
[    8.290819]                    kernel_thread+0x55/0x70
[    8.291421]                    rest_init+0x21/0x258
[    8.291995]                    start_kernel+0x566/0x587
[    8.292611]                    secondary_startup_64_no_verify+0xc2/0xcb
[    8.293384]  }
[    8.293586]  ... key      at: [<ffffffff984ce3a0>] __key.0+0x0/0x10
[    8.294324]  ... acquired at:
[    8.294679]    lock_acquire+0x13f/0x3f0
[    8.295133]    _raw_read_lock+0x5d/0x70
[    8.295589]    brute_task_fatal_signal.cold+0x49/0xe0
[    8.296185]    security_task_fatal_signal+0x22/0x30
[    8.296772]    get_signal+0x3e0/0xcf0
[    8.297212]    arch_do_signal+0x30/0x880
[    8.297682]    exit_to_user_mode_prepare+0xfc/0x170
[    8.298263]    syscall_exit_to_user_mode+0x38/0x240
[    8.298844]    entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    8.299464] 
[    8.299648] 
[    8.299648] the dependencies between the lock to be acquired
[    8.299648]  and SOFTIRQ-irq-unsafe lock:
[    8.300969] -> (tasklist_lock){.+.+}-{2:2} {
[    8.301481]    HARDIRQ-ON-R at:
[    8.301857]                     lock_acquire+0x13f/0x3f0
[    8.302482]                     _raw_read_lock+0x5d/0x70
[    8.303110]                     do_wait+0xd2/0x2e0
[    8.303677]                     kernel_wait+0x49/0x90
[    8.304266]                     call_usermodehelper_exec_work+0x61/0xb0
[    8.305038]                     process_one_work+0x23e/0x580
[    8.305700]                     worker_thread+0x55/0x3c0
[    8.306320]                     kthread+0x141/0x160
[    8.306895]                     ret_from_fork+0x22/0x30
[    8.307506]    SOFTIRQ-ON-R at:
[    8.307880]                     lock_acquire+0x13f/0x3f0
[    8.308505]                     _raw_read_lock+0x5d/0x70
[    8.309125]                     do_wait+0xd2/0x2e0
[    8.309689]                     kernel_wait+0x49/0x90
[    8.310272]                     call_usermodehelper_exec_work+0x61/0xb0
[    8.311044]                     process_one_work+0x23e/0x580
[    8.311709]                     worker_thread+0x55/0x3c0
[    8.312335]                     kthread+0x141/0x160
[    8.312910]                     ret_from_fork+0x22/0x30
[    8.313529]    INITIAL USE at:
[    8.313897]                    lock_acquire+0x13f/0x3f0
[    8.314515]                    _raw_write_lock_irq+0x34/0x50
[    8.315187]                    copy_process+0x11d5/0x1cb0
[    8.315834]                    kernel_clone+0x9b/0x3f0
[    8.316446]                    kernel_thread+0x55/0x70
[    8.317057]                    rest_init+0x21/0x258
[    8.317636]                    start_kernel+0x566/0x587
[    8.318250]                    secondary_startup_64_no_verify+0xc2/0xcb
[    8.319033]    INITIAL READ USE at:
[    8.319452]                         lock_acquire+0x13f/0x3f0
[    8.320117]                         _raw_read_lock+0x5d/0x70
[    8.320777]                         do_wait+0xd2/0x2e0
[    8.321387]                         kernel_wait+0x49/0x90
[    8.322031]                         call_usermodehelper_exec_work+0x61/0xb0
[    8.322852]                         process_one_work+0x23e/0x580
[    8.323557]                         worker_thread+0x55/0x3c0
[    8.324220]                         kthread+0x141/0x160
[    8.324837]                         ret_from_fork+0x22/0x30
[    8.325490]  }
[    8.325694]  ... key      at: [<ffffffff97206098>] tasklist_lock+0x18/0x40
[    8.326509]  ... acquired at:
[    8.326871]    lock_acquire+0x13f/0x3f0
[    8.327330]    _raw_read_lock+0x5d/0x70
[    8.327790]    brute_task_fatal_signal.cold+0x49/0xe0
[    8.328392]    security_task_fatal_signal+0x22/0x30
[    8.328958]    get_signal+0x3e0/0xcf0
[    8.329388]    arch_do_signal+0x30/0x880
[    8.329855]    exit_to_user_mode_prepare+0xfc/0x170
[    8.330434]    syscall_exit_to_user_mode+0x38/0x240
[    8.331010]    entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    8.331622] 
[    8.331801] 
[    8.331801] stack backtrace:
[    8.332322] CPU: 2 PID: 2505 Comm: runc Not tainted 5.10.0-rc3 #2
[    8.333037] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
[    8.334088] Call Trace:
[    8.334385]  dump_stack+0x77/0x97
[    8.334779]  check_irq_usage.cold+0x279/0x2ec
[    8.335292]  ? check_noncircular+0x75/0x110
[    8.335783]  __lock_acquire+0x12e0/0x2590
[    8.336256]  lock_acquire+0x13f/0x3f0
[    8.336696]  ? brute_task_fatal_signal.cold+0x49/0xe0
[    8.337284]  _raw_read_lock+0x5d/0x70
[    8.337721]  ? brute_task_fatal_signal.cold+0x49/0xe0
[    8.338315]  brute_task_fatal_signal.cold+0x49/0xe0
[    8.338885]  security_task_fatal_signal+0x22/0x30
[    8.339435]  get_signal+0x3e0/0xcf0
[    8.339849]  arch_do_signal+0x30/0x880
[    8.340290]  ? rcu_read_lock_sched_held+0x3f/0x70
[    8.340853]  ? kfree+0x25d/0x2c0
[    8.341241]  exit_to_user_mode_prepare+0xfc/0x170
[    8.341796]  syscall_exit_to_user_mode+0x38/0x240
[    8.342350]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    8.342945] RIP: 0033:0x461923
[    8.343310] Code: 24 20 c3 cc cc cc cc 48 8b 7c 24 08 8b 74 24 10 8b 54 24 14 4c 8b 54 24 18 4c 8b 44 24 20 44 8b 4c 24 28 b8 ca 00 00 00 0f 05 <89> 44 24 30 c3 cc cc cc cc cc cc cc cc 8b 7c 24 08 48 8b 74 24 10
[    8.345488] RSP: 002b:00007fd62affcda8 EFLAGS: 00000286 ORIG_RAX: 00000000000000ca
[    8.346385] RAX: fffffffffffffe00 RBX: 000000c000032e00 RCX: 0000000000461923
[    8.347225] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 0000000000b3b898
[    8.348071] RBP: 00007fd62affcdf0 R08: 0000000000000000 R09: 0000000000000000
[    8.349006] R10: 0000000000000000 R11: 0000000000000286 R12: 000000c000001680
[    8.349838] R13: 00007ffd85b25fcf R14: 00007ffd85b260d0 R15: 00007fd62affcfc0

I think it should be possible to using existing task locking semantics
to manage the statistics, but I'll need to take a closer look.

> Other implementations
> ---------------------
> 
> The public version of grsecurity, as a summary, is based on the idea of
> delay the fork system call if a child died due to a fatal error. This has
> some issues:
> 
> 1.- Bad practices: Add delays to the kernel is, in general, a bad idea.
> 
> 2.- Weak points: This protection can be bypassed using two different
>     methods since it acts only when the fork is called after a child has
>     crashed.
> 
>     2.1.- Bypass 1: So, it would still be possible for an attacker to fork
>           a big amount of children (in the order of thousands), then probe
>           all of them, and finally wait the protection time before repeat
>           the steps.
> 
>     2.2.- Bypass 2: This method is based on the idea that the protection
>           doesn't act if the parent crashes. So, it would still be possible
>           for an attacker to fork a process and probe itself. Then, fork
>           the child process and probe itself again. This way, these steps
>           can be repeated infinite times without any mitigation.

It's good to clarify what the expected behaviors should be; however,
while working with the resulting system, it wasn't clear what the threat
model was for this defense. I think we need two things: clear
descriptions of what is expected to be detected (and what is not), and a
set of self-tests that can be used to validate those expectations.

Also, what, specifically, does "fatal error" cover? Is it strictly fatal
signals? (i.e. "error" might refer to exit code, for example.)

> 
> This implementation
> -------------------
> 
> The main idea behind this implementation is to improve the existing ones
> focusing on the weak points annotated before. The solution for the first
> bypass method is to detect a fast crash rate instead of only one simple
> crash. For the second bypass method the solution is to detect both the
> crash of parent and child processes. Moreover, as a mitigation method it is
> better to kill all the offending tasks involve in the attack instead of use
> delays.
> 
> So, the solution to the two bypass methods previously commented is to use
> some statistical data shared across all the processes that can have the
> same memory contents. Or in other words, a statistical data shared between
> all the fork hierarchy processes after an execve system call.

Hm, is this already tracked in some other way? i.e. the family hierarchy
of the mm struct? They're only shared on clone, but get totally copied on
fork(). Should that be the place where this is tracked instead? (i.e. I
could fork and totally rearrange my VMAs.)

> 
> The purpose of these statistics is to compute the application crash period
> in order to detect an attack. This crash period is the time between the
> execve system call and the first fault or the time between two consecutives
> faults, but this has a drawback. If an application crashes once quickly
> from the execve system call or crashes twice in a short period of time for
> some reason, a false positive attack will be triggered. To avoid this
> scenario the shared statistical data holds a list of the i last crashes
> timestamps and the application crash period is computed as follows:
> 
> crash_period = (n_last_timestamp - n_minus_i_timestamp) / i;
> 
> This ways, the size of the last crashes timestamps list allows to fine
> tuning the detection sensibility.

Instead of a list, can't the rate just be calculated on an on-going
basis?

> When this crash period falls under a certain threshold there is a clear
> signal that something malicious is happening. Once detected, the mitigation
> only kills the processes that share the same statistical data and so, all
> the tasks that can have the same memory contents. This way, an attack is
> rejected.

Here's where I think the threat model needs some more work. The above
describes what I think is a less common situation. I expect the common
attack to hold still with a single value, and let the fork/exec spin
until the value lines up. (i.e. a fork is required.)

Here are the threat scenarios that come to mind for me:

1- launching (fork/exec) a setuid process repeatedly until you get a
   desirable memory layout (e.g. what Stack Clash[1] did).

2- connecting to an exec()ing network daemon (e.g. xinetd) repeatedly
   until you get a desirable memory layout (e.g. what CTFs do for simple
   network service[2]).

3- launching processes _without_ exec (e.g. Android Zygote[3]), and
   exposing state to attack a sibling.

4- connecting to a fork()ing network daemon (e.g. apache) repeatedly
   until you expose the previously-shared memory layout of all the other
   children (e.g. kind of related to HeartBleed[3], though that was a
   direct exposure not a crasher).

In each case, a privilege boundary is being crossed (setuid in the first,
priv-changes in the 2nd, and network-to-local in the latter two), so I
suspect that kind of detail will need to play a part in the design to
help avoid false positives.

Regardless, when I tested this series, 1 and 3 isn't detected, since
they pass through an execve(), and I think that needs to be covered as
well.

[1] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
[2] https://github.com/BSidesPDX/CTF-2017/blob/master/pwn/200-leek/src/leek.service
[3] https://link.springer.com/article/10.1007/s10207-018-00425-8
[4] https://heartbleed.com/

> 
> 1.- Per system enabling: This feature can be enabled at build time using
>     the CONFIG_SECURITY_FORK_BRUTE option or using the visual config
>     application under the following menu:
> 
>     Security options ---> Fork brute force attack detection and mitigation

(there is a built-in boot time disabling too, by changing the lsm=
bootparam)

> 
> 2.- Per process enabling/disabling: To allow that specific applications can
>     turn off or turn on the detection and mitigation of a fork brute force
>     attack when required, there are two new prctls.
> 
>     prctl(PR_SECURITY_FORK_BRUTE_ENABLE, 0, 0, 0, 0)
>     prctl(PR_SECURITY_FORK_BRUTE_DISABLE, 0, 0, 0, 0)

How do you see this being used?

> 3.- Fine tuning: To customize the detection's sensibility there are two new
>     sysctl attributes that allow to set the last crashes timestamps list
>     size and the application crash period threshold (in milliseconds). Both
>     are accessible through the following files respectively.
> 
>     /proc/sys/kernel/brute/timestamps_list_size
>     /proc/sys/kernel/brute/crash_period_threshold
> 
>     The list size allows to avoid false positives due to crashes unrelated
>     with a real attack. The period threshold sets the time limit to detect
>     an attack. And, since a fork brute force attack will be detected if the
>     application crash period falls under this threshold, the higher this
>     value, the more sensitive the detection will be.

I wonder if these will be needed as we narrow in on the specific threat
model (i.e. there will be enough additional signal to obviate this
tuning).

And in testing I found another false positive that I haven't fully
diagnosed. I found that at boot, with Docker installed, "runc" would
immediately trip the mitigation. With some debugging added, I looks like
runc had several forked processes that got SIGKILLed in quick succession,
and then the entire group got killed by Brute. I haven't narrowed down
what runc is doing here, but it makes me wonder if there might need
to be an exception for user-space delivered signals, as opposed to
kernel-delivered signals...

Thanks again for the work! I'm liking the idea of getting a solid
protection for this. It's been a long-standing hole in upstream. :)

-Kees
John Wood Nov. 15, 2020, 3 p.m. UTC | #3
On Tue, Nov 10, 2020 at 04:10:49PM -0800, Kees Cook wrote:
> On Sun, Oct 25, 2020 at 02:45:32PM +0100, John Wood wrote:
> > Attacks against vulnerable userspace applications with the purpose to break
> > ASLR or bypass canaries traditionaly use some level of brute force with the
> > help of the fork system call. This is possible since when creating a new
> > process using fork its memory contents are the same as those of the parent
> > process (the process that called the fork system call). So, the attacker
> > can test the memory infinite times to find the correct memory values or the
> > correct memory addresses without worrying about crashing the application.
> >
> > Based on the above scenario it would be nice to have this detected and
> > mitigated, and this is the goal of this patch serie.
>
> Thanks for preparing the v2! I spent some time looking at this today,
> and I really like how it has been rearranged into an LSM. This feels
> much more natural.

Thank you very much for reviewing this work.

> Various notes:
>
> The locking isn't right; it'll trip with CONFIG_PROVE_LOCKING=y.
> Here's the giant splat:
>
> [    8.205146] brute: Fork brute force attack detected
> [    8.206821]
> [    8.207317] =====================================================
> [    8.209392] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
> [    8.211852] 5.10.0-rc3 #2 Not tainted
> [    8.213215] -----------------------------------------------------
> [...]

Interesting... My ".config" also have this option "on" but this warning
never appeared. Under what circumstances did it happen? I tested this
patchset with two custom test programs with no problem. Maybe my tests
are not good enough :(

> I think it should be possible to using existing task locking semantics
> to manage the statistics, but I'll need to take a closer look.
>
> > Other implementations
> > ---------------------
> >
> > The public version of grsecurity, as a summary, is based on the idea of
> > delay the fork system call if a child died due to a fatal error. This has
> > some issues:
> >
> > 1.- Bad practices: Add delays to the kernel is, in general, a bad idea.
> >
> > 2.- Weak points: This protection can be bypassed using two different
> >     methods since it acts only when the fork is called after a child has
> >     crashed.
> >
> >     2.1.- Bypass 1: So, it would still be possible for an attacker to fork
> >           a big amount of children (in the order of thousands), then probe
> >           all of them, and finally wait the protection time before repeat
> >           the steps.
> >
> >     2.2.- Bypass 2: This method is based on the idea that the protection
> >           doesn't act if the parent crashes. So, it would still be possible
> >           for an attacker to fork a process and probe itself. Then, fork
> >           the child process and probe itself again. This way, these steps
> >           can be repeated infinite times without any mitigation.
>
> It's good to clarify what the expected behaviors should be; however,
> while working with the resulting system, it wasn't clear what the threat
> model was for this defense. I think we need two things: clear
> descriptions of what is expected to be detected (and what is not), and a
> set of self-tests that can be used to validate those expectations.

Ok, I will make the necessary changes in order to clarify the commented
points. I will also add the self-tests. Thanks for the suggestion.

> Also, what, specifically, does "fatal error" cover? Is it strictly fatal
> signals? (i.e. "error" might refer to exit code, for example.)

Ok, understood.

> >
> > This implementation
> > -------------------
> >
> > The main idea behind this implementation is to improve the existing ones
> > focusing on the weak points annotated before. The solution for the first
> > bypass method is to detect a fast crash rate instead of only one simple
> > crash. For the second bypass method the solution is to detect both the
> > crash of parent and child processes. Moreover, as a mitigation method it is
> > better to kill all the offending tasks involve in the attack instead of use
> > delays.
> >
> > So, the solution to the two bypass methods previously commented is to use
> > some statistical data shared across all the processes that can have the
> > same memory contents. Or in other words, a statistical data shared between
> > all the fork hierarchy processes after an execve system call.
>
> Hm, is this already tracked in some other way? i.e. the family hierarchy
> of the mm struct? They're only shared on clone, but get totally copied on
> fork(). Should that be the place where this is tracked instead? (i.e. I
> could fork and totally rearrange my VMAs.)

Great, I will give it a try and study this hierarchy. If this is feasible,
the correct path shoud be to create a new security blob hold by the mm
struct? Also, I think that new LSM hooks will be needed to notify the
allocation, release and copy of this structure. What do you think, am I in
the right direction?

> > The purpose of these statistics is to compute the application crash period
> > in order to detect an attack. This crash period is the time between the
> > execve system call and the first fault or the time between two consecutives
> > faults, but this has a drawback. If an application crashes once quickly
> > from the execve system call or crashes twice in a short period of time for
> > some reason, a false positive attack will be triggered. To avoid this
> > scenario the shared statistical data holds a list of the i last crashes
> > timestamps and the application crash period is computed as follows:
> >
> > crash_period = (n_last_timestamp - n_minus_i_timestamp) / i;
> >
> > This ways, the size of the last crashes timestamps list allows to fine
> > tuning the detection sensibility.
>
> Instead of a list, can't the rate just be calculated on an on-going
> basis?

The first version computed the application crash period on the fly
(without a list) using the accumulated time and the number of faults.

period = time_since_execve / faults_since_execve;

But this method has two major drawbacks. The first (noted by Jann Horn),
is that if an application runs for a long time (say, a month) without any
crash, and the detection threshold is set to 30 s (default value), an
attacker can break the app approximately:

30*24*60*60/30 = 86400 times

before the detection is triggered.

The second drawback is that if the first fault is fast (from the execve
system call) the detection will be triggered instantly. This is a problem
if an application fails quickly for reasons that are not related to a real
attack.

I think to avoid the first drawback we can compute the crash period using
a kind of weighted average (setting more weight for the last time between
faults). But the second drawback does not go away.

However, with the list we don't have any of the commented drawbacks (only a
larger memory footprint and more calculation time).

Any ideas to remove the list are welcome ;)

> > When this crash period falls under a certain threshold there is a clear
> > signal that something malicious is happening. Once detected, the mitigation
> > only kills the processes that share the same statistical data and so, all
> > the tasks that can have the same memory contents. This way, an attack is
> > rejected.
>
> Here's where I think the threat model needs some more work. The above
> describes what I think is a less common situation. I expect the common
> attack to hold still with a single value, and let the fork/exec spin
> until the value lines up. (i.e. a fork is required.)
>
> Here are the threat scenarios that come to mind for me:
>
> 1- launching (fork/exec) a setuid process repeatedly until you get a
>    desirable memory layout (e.g. what Stack Clash[1] did).
>
> 2- connecting to an exec()ing network daemon (e.g. xinetd) repeatedly
>    until you get a desirable memory layout (e.g. what CTFs do for simple
>    network service[2]).
>
> 3- launching processes _without_ exec (e.g. Android Zygote[3]), and
>    exposing state to attack a sibling.
>
> 4- connecting to a fork()ing network daemon (e.g. apache) repeatedly
>    until you expose the previously-shared memory layout of all the other
>    children (e.g. kind of related to HeartBleed[3], though that was a
>    direct exposure not a crasher).
>
> In each case, a privilege boundary is being crossed (setuid in the first,
> priv-changes in the 2nd, and network-to-local in the latter two), so I
> suspect that kind of detail will need to play a part in the design to
> help avoid false positives.

Ok, I will try to play with privilege boundary crossing in the next
version.

> Regardless, when I tested this series, 1 and 3 isn't detected, since
> they pass through an execve(), and I think that needs to be covered as
> well.

Thanks, I will work on it and review the execve case. Also, thanks for
showing me real examples of fork brute force attacks ;)

> [1] https://www.qualys.com/2017/06/19/stack-clash/stack-clash.txt
> [2] https://github.com/BSidesPDX/CTF-2017/blob/master/pwn/200-leek/src/leek.service
> [3] https://link.springer.com/article/10.1007/s10207-018-00425-8
> [4] https://heartbleed.com/
>
> >
> > 1.- Per system enabling: This feature can be enabled at build time using
> >     the CONFIG_SECURITY_FORK_BRUTE option or using the visual config
> >     application under the following menu:
> >
> >     Security options ---> Fork brute force attack detection and mitigation
>
> (there is a built-in boot time disabling too, by changing the lsm=
> bootparam)

Thanks, I will add it to the documentation.

> > 2.- Per process enabling/disabling: To allow that specific applications can
> >     turn off or turn on the detection and mitigation of a fork brute force
> >     attack when required, there are two new prctls.
> >
> >     prctl(PR_SECURITY_FORK_BRUTE_ENABLE, 0, 0, 0, 0)
> >     prctl(PR_SECURITY_FORK_BRUTE_DISABLE, 0, 0, 0, 0)
>
> How do you see this being used?

Maybe the PR_SECURITY_FORK_BRUTE_DISABLE prctl can be used by applications
like "runc" (commented below) if we cannot exclude this false positive. I
mean during phases that may need fork and kill in a quick way. Then, during
other phases, the protection could be enabled again. Makes sense?

> > 3.- Fine tuning: To customize the detection's sensibility there are two new
> >     sysctl attributes that allow to set the last crashes timestamps list
> >     size and the application crash period threshold (in milliseconds). Both
> >     are accessible through the following files respectively.
> >
> >     /proc/sys/kernel/brute/timestamps_list_size
> >     /proc/sys/kernel/brute/crash_period_threshold
> >
> >     The list size allows to avoid false positives due to crashes unrelated
> >     with a real attack. The period threshold sets the time limit to detect
> >     an attack. And, since a fork brute force attack will be detected if the
> >     application crash period falls under this threshold, the higher this
> >     value, the more sensitive the detection will be.
>
> I wonder if these will be needed as we narrow in on the specific threat
> model (i.e. there will be enough additional signal to obviate this
> tuning).

I agree. If possible, I will remove as much configuration as possible.
Thanks.

> And in testing I found another false positive that I haven't fully
> diagnosed. I found that at boot, with Docker installed, "runc" would
> immediately trip the mitigation. With some debugging added, I looks like
> runc had several forked processes that got SIGKILLed in quick succession,
> and then the entire group got killed by Brute. I haven't narrowed down
> what runc is doing here, but it makes me wonder if there might need
> to be an exception for user-space delivered signals, as opposed to
> kernel-delivered signals...

I will try to work on this as well.

> Thanks again for the work! I'm liking the idea of getting a solid
> protection for this. It's been a long-standing hole in upstream. :)

It's a pleasure to work on this protection. This is the perfect task to
learn a lot.

Thank you very much for the review and all the comments. I know this
review was streamed on twitch but, due to my main language is not english,
I cannot follow speech the way I want. Will this video be uploaded to
youtube? There I can turn on the subtitles. This way I can learn and
improve more things. Thanks again.

>
> -Kees
>
> --
> Kees Cook

Regards,
John Wood