Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Wed, 8 May 2024 00:42:57 +0800
From: HexRabbit Chen <hexrabbit@...co.re>
To: oss-security@...ts.openwall.com
Subject: CVE-2024-26925: Linux: nf_tables: locking issue in the
 nf_tables_abort() function

Hello,

I found a locking issue in nf_tables set element GC implementation and
exploited it in kernelCTF. The bug breaks the sequence number assumption
in set asynchronous GC, which can be used to cause double free, and
leads to local privilege escalation.

Introduced in v6.5:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=720344340fb9

Fixed in v6.9-rc3:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0d459e2ffb54

========================================================================
Vulnerability Details
========================================================================

The asynchronous set GC (nft_rhash_gc for example) does not acquire
commit lock while doing the work, instead, it uses GC sequence (gc_seq)
mechanism to protect it from racing with the transaction.

At the begin of nft_rhash_gc() it will save the current GC sequence[1]
and allocate a GC transaction to store information, then traverse the
set to record all expired set element into GC transaction, and finally
call nft_trans_gc_queue_async_done(gc)[3].

------------------------------------------------------------------------
static void nft_rhash_gc(struct work_struct *work)
{
    // ...
    gc_seq = READ_ONCE(nft_net->gc_seq); // [1]

    if (nft_set_gc_is_pending(set))
        goto done;

    gc = nft_trans_gc_alloc(set, gc_seq, GFP_KERNEL);
    if (!gc)
        goto done;

    // ...
    while ((he = rhashtable_walk_next(&hti))) {
        // check if setelem expired
        if (!nft_set_elem_expired(&he->ext))
            continue;

        // ...
        nft_trans_gc_elem_add(gc, he); // [2]
    }

    if (gc)
        nft_trans_gc_queue_async_done(gc); // [3]

    // ...
}
------------------------------------------------------------------------

The function nft_trans_gc_queue_async_done() saves the GC transaction
into a global list and eventually schedules nft_trans_gc_work() to run.
nft_trans_gc_work() then retrieves the gc transaction and calls
nft_trans_gc_work_done() to perform checks on GC sequence.

------------------------------------------------------------------------
static void nft_trans_gc_work(struct work_struct *work)
{
    // ...
    list_for_each_entry_safe(trans, next, &trans_gc_list, list) {
        list_del(&trans->list);
        if (!nft_trans_gc_work_done(trans)) { // [4]
            nft_trans_gc_destroy(trans);
            continue;
        }
        call_rcu(&trans->rcu, nft_trans_gc_trans_free); // [5]
    }
}
------------------------------------------------------------------------

The function nft_trans_gc_work_done() will first acquire the commit
lock[6], and compare the saved GC sequence with current GC sequence[7],
if they are different, which means we race with the transaction. Since
all critical section which modify the control plane are surrounded by
nft_gc_seq_begin() and nft_gc_seq_end() which both increase the current
GC sequence (nft_net->gc_seq), so if it's the case, it means the state
of the set may have been changed, and the function will return false to
stop processing this GC transaction.

------------------------------------------------------------------------
static bool nft_trans_gc_work_done(struct nft_trans_gc *trans)
{
    struct nftables_pernet *nft_net;
    struct nft_ctx ctx = {};

    nft_net = nft_pernet(trans->net);

    mutex_lock(&nft_net->commit_mutex); // [6]

    /* Check for race with transaction, otherwise this batch refers to
     * stale objects that might not be there anymore. Skip transaction if
     * set has been destroyed from control plane transaction in case gc
     * worker loses race.
     */
    if (READ_ONCE(nft_net->gc_seq) != trans->seq || trans->set->dead) { // [7]
        mutex_unlock(&nft_net->commit_mutex);
        return false;
    }

    ctx.net = trans->net;
    ctx.table = trans->set->table;

    nft_trans_gc_setelem_remove(&ctx, trans);
    mutex_unlock(&nft_net->commit_mutex);

    return true;
}
------------------------------------------------------------------------

However, the GC sequence mechanism only works under the assumption that
the commit lock should not be released during the critical section
between nft_gc_seq_begin() and nft_gc_seq_end(). Otherwise, a GC thread
may record the expired object and obtain the released commit lock within
the same gc_seq, thus bypassing the GC sequence check.

__nf_tables_abort() is the one does it wrong, the function is surrounded
by nft_gc_seq_begin() and nft_gc_seq_end()[8], if it received the action
NFNL_ABORT_AUTOLOAD, nf_tables_module_autoload()[9] will be called to
process the module requests, however, the function releases the commit
lock[10] before processing the module request, which breaks the
assumption of GC sequence. As a result, it's now possible for a disabled
set element object to bypass the check, leading to a double free.

------------------------------------------------------------------------
static int nf_tables_abort(struct net *net, struct sk_buff *skb,
               enum nfnl_abort_action action)
{
    gc_seq = nft_gc_seq_begin(nft_net); // [8]
    ret = __nf_tables_abort(net, action);
    nft_gc_seq_end(nft_net, gc_seq); // [8]
    mutex_unlock(&nft_net->commit_mutex);

    return ret;
}

static int __nf_tables_abort(struct net *net, enum nfnl_abort_action action)
{
    // ...

    if (action == NFNL_ABORT_AUTOLOAD)
        nf_tables_module_autoload(net); // [9]
    else
        nf_tables_module_autoload_cleanup(net);

    return 0;
}

static void nf_tables_module_autoload(struct net *net)
{
    struct nftables_pernet *nft_net = nft_pernet(net);
    struct nft_module_request *req, *next;
    LIST_HEAD(module_list);

    list_splice_init(&nft_net->module_list, &module_list);
    mutex_unlock(&nft_net->commit_mutex); // [10]
    list_for_each_entry_safe(req, next, &module_list, list) {
        request_module("%s", req->module);
        req->done = true;
    }
    mutex_lock(&nft_net->commit_mutex);
    list_splice(&module_list, &nft_net->module_list);
}
------------------------------------------------------------------------

The fix to this vulnerability is pretty simple:
just move nf_tables_module_autoload() after the call to nft_gc_seq_end()

------------------------------------------------------------------------
static int nf_tables_abort(struct net *net, struct sk_buff *skb,
               enum nfnl_abort_action action)
{
    gc_seq = nft_gc_seq_begin(nft_net);
    ret = __nf_tables_abort(net, action);
    nft_gc_seq_end(nft_net, gc_seq);

    WARN_ON_ONCE(!list_empty(&nft_net->commit_list));

+   /* module autoload needs to happen after GC sequence update because it
+    * temporarily releases and grabs mutex again.
+    */
+   if (action == NFNL_ABORT_AUTOLOAD)
+       nf_tables_module_autoload(net);
+   else
+       nf_tables_module_autoload_cleanup(net);

    mutex_unlock(&nft_net->commit_mutex);

    return ret;
}
------------------------------------------------------------------------

========================================================================
Exploit Disclosure
========================================================================
I will publish the exploit for this vulnerability on Google's
security-research GitHub repo 30 days after the CVE disclosure, in
accordance with Google's disclosure policy. (approximately on 5/24)

========================================================================
Discoverer
========================================================================
Kuan-Ting Chen (@h3xr4bb1t) of DEVCORE Research Team

Best,
Kuan-Ting Chen

Powered by blists - more mailing lists

Please check out the Open Source Software Security Wiki, which is counterpart to this mailing list.

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.