Linux kernel patch from the Openwall Project: README

Follow @Openwall on Twitter for new release announcements and other news

Linux kernel patch from the Openwall Project.

Overview.

This is a security hardening patch for the Linux kernel. The patch includes changes that fall into three categories:

1. A collection of security-related features for the Linux kernel, all configurable via the new "Security options" configuration section. These are described in detail below.

2. Security hardening for the kernel itself (that is, against potential vulnerabilities in the kernel).

3. Additionally, some versions of the patch contain various security fixes for known vulnerabilities in the kernel. The number of such fixes changes from version to version, as some are becoming obsolete (such as because of the same problem getting fixed with a new kernel release), while other security issues are discovered.

Non-executable user stack area.

Most buffer overflow exploits are based on overwriting a function's return address on the stack to point to some arbitrary code, which is also put onto the stack. If the stack area is non-executable, buffer overflow vulnerabilities become harder to exploit.

Another way to exploit a buffer overflow is to point the return address to a function in libc, usually system(). This patch also changes the default address that shared libraries are mmap()'ed at to make it always contain a zero byte. This makes it impossible to specify any more data (parameters to the function, or more copies of the return address when filling with a pattern), -- in many exploits that have to do with ASCIIZ strings.

However, note that this patch is by no means a complete solution, it just adds an extra layer of security. Many buffer overflow vulnerabilities will remain exploitable a more complicated way, and some will even remain unaffected by the patch. The reason for using such a patch is to protect against some of the buffer overflow vulnerabilities that are yet unknown.

Also, note that some buffer overflows can be used for denial of service attacks (usually in non-respawning daemons and network clients). A patch like this cannot do anything against that.

It is important that you fix vulnerabilities as soon as they become known, even if you're using the patch. The same applies to other features of the patch (discussed below) and their corresponding vulnerabilities.

Restricted access to VM86 mode.

On x86 processors, the Virtual 8086 (VM86) mode allows the execution of real mode operating systems and applications (primarily DOS) under protected mode operating systems such as Linux (with dosemu). This requires support from the kernel. Although the amount of kernel code needed to support the VM86 mode is small and no security problems with it are currently known, that code is unused on most Linux systems and as such it poses an unreasonable risk. This option restricts access to system calls used to enter the VM86 mode to processes that possess the CAP_SYS_RAWIO capability. The effect is that any potential security bugs in the VM86 mode support code are neutralized.

Restricted zero page mappings.

On modern operating systems such as Linux, the memory page at virtual address zero is typically not mapped in order to trap NULL pointer dereference bugs in both user-space programs and the kernel. However, a malicious user-space program may map the zero page and then invoke a system call that would be known to trigger a NULL pointer dereference in the kernel. Depending on the specific NULL pointer dereference bug, it may be possible to get the kernel to read from, write to, or execute code at an arbitrary kernel space address, thereby completely compromising system security. Enabling this option introduces logging of failed attempts to map low pages and sets the vm.mmap_min_addr sysctl to 32768 by default, which restricts the ability to map the zero page (and a few more pages) to processes that possess the CAP_SYS_RAWIO capability. This should reduce the impact of most NULL pointer dereference bugs to no worse than denial of service. Of course, the value of vm.mmap_min_addr may be adjusted on the running system, including setting it to 0 to disable this hardening measure (in some cases this may be needed to run programs such as Wine and QEMU - although in many cases they will run just fine even with the hardening measure enabled).

Restricted links in /tmp.

I've also added a link-in-+t restriction, originally for Linux 2.0 only, by Andrew Tridgell. I've updated it to prevent from using a hard link in an attack instead, by not allowing regular users to create hard links to files they don't own, unless they could read and write the file (due to group permissions). This is usually the desired behavior anyway, since otherwise users couldn't remove such links they've just created in a +t directory (unfortunately, this is still possible for group-writable files) and because of disk quotas.

Unfortunately, this may break existing applications.

Restricted FIFOs in /tmp.

In addition to restricting links, you might also want to restrict writes into untrusted FIFOs (named pipes), to make data spoofing attacks harder. Enabling this option disallows writing into FIFOs not owned by the user in +t directories, unless the owner is the same as that of the directory or the FIFO is opened without the O_CREAT flag.

Restricted /proc.

This was originally a patch by route that only changed the permissions on some directories in /proc, so you had to be root to access them. Then there were similar patches by others. I found them all quite unusable for my purposes, on a system where I wanted several admins to be able to see all the processes, etc, without having to su root (or use sudo) each time. So I had to create my own patch that I include here.

This option restricts the permissions on /proc so that non-root users can see their own processes only, and nothing about active network connections, unless they're in a special group. This group's id is specified via the gid= mount option, and is 0 by default. (Note: if you're using identd, you will need to edit the inetd.conf line to run identd as this special group.) Also, this disables dmesg(8) for the users. You might want to use this on an ISP shell server where privacy is an issue. Note that these extra restrictions can be trivially bypassed with physical access (without having to reboot).

When using this part of the patch, most programs (ps, top, who) work as desired -- they only show the processes of this user (unless root or in the special group, or running with the relevant capabilities on 2.2+), and don't complain they can't access others. However, there's a known problem with w(1) in recent versions of procps, so you should apply the included patch to procps if this applies to you.

Special handling of fd 0, 1, and 2 (Linux 2.0 and 2.2 only).

File descriptors 0, 1, and 2 have a special meaning for the C library and lots of programs. Thus, they're often referenced by number. Still, it is normally possible to execute a program with one or more of these fd's closed, and any open(2) calls it might do will happily provide these fd numbers. The program (or the libraries it is linked with) will continue using the fd's for their usual purposes, in reality accessing files the program has just opened. If such a program is installed SUID and/or SGID, then we might have a security problem.

Enable this option to ensure that fd's 0, 1, and 2 are always open on startup of a SUID/SGID binary. If any of the fd's is closed, "/dev/null" will be opened for it (the device itself; you don't need to have /dev in the filesystem for that to work, such as in a chroot). This part of the patch is by Pavel Kankovsky, I've only ported it to Linux 2.2 (any errors are mine, of course).

Enforce RLIMIT_NPROC on execve(2).

Linux lets you set a limit on how many processes a user can have, via a setrlimit(2) call with RLIMIT_NPROC. Unfortunately, this limit is only looked at when a new process is created on fork(2). If a process changes its UID, it might exceed the limit for its new UID.

This is not a security issue by itself, as changing the UID is a privileged operation. However, there are privileged programs that want to switch to a user's context, including setting up some resource limits. The only fork(2) required (if at all) is done before switching the UID, and thus doesn't result in a check against RLIMIT_NPROC.

Enable this option to enforce RLIMIT_NPROC on execve(2) calls. (The Linux 2.0 version of this patch only checks the limit for processes that have their "dumpable" flag reset, such as due to an UID change, to reduce the performance impact.)

Note that there's at least one good reason I am not enforcing the limit right after setuid(2) calls: some programs don't expect setuid(2) to fail when running as root.

Destroy shared memory segments not in use.

Linux lets you set resource limits, including on how much memory a process can consume, via setrlimit(2). Unfortunately, shared memory segments are allowed to exist without association with any process, and thus might not be counted against any resource limits.

This option automatically destroys shared memory segments when their attach count becomes zero after a detach or a process termination. It will also destroy segments that were created, but never attached to, on exit from the process. (In case you're curious, the only use left for IPC_RMID is to immediately destroy an unattached segment.)

Of course, this breaks the way things are defined, so some applications might stop working. In particular, expect most commercial databases to break. Apache and PostgreSQL are known to work, though. :-)

Note that this feature will do you no good unless you also configure your resource limits (in particular, RLIMIT_AS and RLIMIT_NPROC). Most systems don't need this.

Privileged IP aliases (Linux 2.0 only).

It is sometimes desirable not to let regular users put their services on some of the IP addresses configured on the system. For example, this is the case when providing web hosting services with shell and/or CGI access, so that one user can't abuse the other domains hosted on the same system.

When this option is enabled, only root can bind sockets to addresses of privileged aliased interfaces: those with slot numbers of the first half of the allowed range. The default limit is also expanded to 2048 aliases, so that the familiar slot numbers of 0 to 1023 become privileged.

How to install.

Make sure you have the original kernel sources (as can be obtained from ftp.kernel.org) installed in /usr/src/linux. Apply the patch:

	cd /usr/src/linux
	patch -p1 < PATCH-FILE

where PATCH-FILE is the full path and name of the linux-*-ow*.diff file.

In kernel configuration, go to the new "Security options" section. Read help for the sub-options and configure them.

If desired, edit /etc/fstab to specify the group id for accessing /proc. Also, make sure you have no extra procfs mount commands in the startup scripts, as these might override your fstab settings; this is the case for some distributions, including Red Hat. (Note that you won't be able to specify the GID by remounting /proc on a running system. This is because filesystem-specific options are not supported at that stage.)

Build the kernel and reboot.

You may also want to add the following line to your /etc/syslog.conf to log [security] alerts separately:

	kern.alert				/var/log/alert

Additionally, you may do something like this (assuming the log file will be empty most of the time):

	> /var/log/alert
	chown root.staff /var/log/alert
	chmod 640 /var/log/alert
	echo "less -XEU /var/log/alert" >> ~non-root/.bash_profile

Ensure that the non-executable stack part of the patch is working, using stacktest.c for that purpose -- running "./stacktest -e" should segfault, and a message should get logged to /var/log/alert (if you've followed the syslogd configuration described above). If you've enabled the support for GCC trampolines, try running "./stacktest -t", it should succeed. If you have trampoline call emulation enabled on Linux 2.0, you should also try "./stacktest -b", the simulated exploit attempt should fail even after a trampoline call in the same process has succeeded.

If you enabled the link-in-+t restriction, you can also try to create a symlink in /tmp (as a non-root user) pointing to a file that user has no read access to, then switch to some other user that has the read access (for example, root) and try to read the file via the link (such as, with "cat /tmp/link"). This should fail, and a message should get logged.

Now, you can try to create a hard link as a non-root user to a file that user doesn't own. This should also fail.

Be sure to check out the FAQ.

--
Solar Designer <solar at openwall.com>