Vulnerability
StackRot (CVE-2023-3269) is a privilege escalation vulnerability in the Linux kernel. This disclosure complies with the linux-distros list policy and aims to provide early information about the vulnerability. While the essential details of the vu
lnerability are included here, a complete exploit code and comprehensive write-up will be publicly available by the end of July. The GitHub repository will be updated, and the oss-security thread will be notified accordingly.
The vulnerability, known as StackRot or "Stack Rot," affects the handling of stack expansion in Linux kernel versions 6.1 through 6.4. The issue arises in the maple tree, responsible for managing virtual memory areas. During node replacement, the MM write lock is not properly acquired, resulting in use-after-free problems. An unprivileged local user can exploit this flaw to compromise the kernel and elevate their privileges.
StackRot impacts nearly all kernel configurations as it is a vulnerability within the memory management subsystem of the Linux kernel. Triggering the vulnerability requires minimal capabilities. However, it is important to note that maple nodes are freed using RCU callbacks, which delay memory deallocation until after the RCU grace period. Consequently, exploiting this vulnerability is considered challenging.
To the best of current knowledge, there are no publicly available exploits targeting use-after-free-by-RCU (UAFBR) bugs. This is the first known instance where UAFBR bugs have been proven to be exploitable, even without the presence of CONFIG_PREEMPT or CONFIG_SLAB_MERGE_DEFAULT settings.
Note: The StackRot vulnerability has existed in the Linux kernel since version 6.1 when the VMA tree structure transitioned from red-black trees to maple trees.
Maple Tree
The maple tree is a range-based B-tree designed to utilize modern processor cache efficiently while ensuring RCU (Read-Copy-Update) safety. Its implementation offers various benefits in the Linux kernel, particularly in areas where a non-overlapping range-based tree with a straightforward interface would be advantageous. If you currently use an rbtree in conjunction with other data structures to enhance performance or an interval tree to track non-overlapping ranges, the maple tree is a suitable alternative.
The tree exhibits a branching factor of 10 for non-leaf nodes and 16 for leaf nodes. Compared to the rbtree, the maple tree has a significantly shorter height, resulting in fewer cache misses. Furthermore, the elimination of the linked list between consecutive entries reduces cache misses and the necessity to fetch the previous and next VMA (Virtual Memory Area) during various tree modifications.
The initial focus of this patch set is to apply the maple tree as a replacement for three data structures in the vm_area_struct. These structures include the augmented rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The ultimate objective is to reduce or eliminate contention on the mmap_lock.
The plan is to transition using the maple tree in RCU mode, where readers will not block writers. Only one write operation will be permitted at a time, and if stale data is encountered, a reader will re-traverse the tree. RCU will be enabled for VMAs, and this mode will be activated once multiple tasks are utilizing the mm_struct.
Issue - StackRot (CVE-2023-3269)
When the mmap() system call is utilized to establish a memory mapping in the Linux kernel, a specialized data structure called vm_area_struct is generated to represent the corresponding virtual memory area (VMA). This VMA structure plays a crucial role as a container, housing a wide range of information such as flags, properties, and other pertinent details that are directly associated with the specific memory mapping operation.
struct vm_area_struct { long unsigned int vm_start; /* 0 8 */ long unsigned int vm_end; /* 8 8 */ struct mm_struct * vm_mm; /* 16 8 */ pgprot_t vm_page_prot; /* 24 8 */ long unsigned int vm_flags; /* 32 8 */ union { struct { struct rb_node rb __attribute__((__aligned__(8))); /* 40 24 */ /* --- cacheline 1 boundary (64 bytes) --- */ long unsigned int rb_subtree_last; /* 64 8 */ } __attribute__((__aligned__(8))) shared __attribute__((__aligned__(8))); /* 40 32 */ struct anon_vma_name * anon_name; /* 40 8 */ } __attribute__((__aligned__(8))); /* 40 32 */ /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */ struct list_head anon_vma_chain; /* 72 16 */ struct anon_vma * anon_vma; /* 88 8 */ const struct vm_operations_struct * vm_ops; /* 96 8 */ long unsigned int vm_pgoff; /* 104 8 */ struct file * vm_file; /* 112 8 */ void * vm_private_data; /* 120 8 */ /* --- cacheline 2 boundary (128 bytes) --- */ atomic_long_t swap_readahead_info; /* 128 8 */ struct vm_userfaultfd_ctx vm_userfaultfd_ctx; /* 136 0 */ /* size: 136, cachelines: 3, members: 14 */ /* forced alignments: 1 */ /* last cacheline: 8 bytes */ } __attribute__((__aligned__(8)));
In order to efficiently handle page faults and other memory-related system calls, the Linux kernel requires rapid lookup of the virtual memory area (VMA) based solely on the address. Traditionally, VMA management relied on red-black trees. However, starting from Linux kernel version 6.1, the transition was made to maple trees. Maple trees are RCU-safe B-tree data structures specifically optimized for storing non-overlapping ranges. Conversely, the introduction of maple trees added complexity to the codebase and introduced the StackRot vulnerability.
The maple tree structure consists of maple nodes. For the purposes of this discussion, we will assume that the maple tree has a single root node, which can accommodate a maximum of 16 intervals. Each interval can represent either a gap or point to a specific VMA. Consequently, it is not possible to have gaps between two intervals within the tree.
struct maple_range_64 { struct maple_pnode * parent; /* 0 8 */ long unsigned int pivot[15]; /* 8 120 */ /* --- cacheline 2 boundary (128 bytes) --- */ union { void * slot[16]; /* 128 128 */ struct { void * pad[15]; /* 128 120 */ /* --- cacheline 3 boundary (192 bytes) was 56 bytes ago --- */ struct maple_metadata meta; /* 248 2 */ }; /* 128 128 */ }; /* 128 128 */ /* size: 256, cachelines: 4, members: 3 */ };
The data structure maple_range_64 is employed to represent a node within the maple tree implementation. This structure is defined as follows: it contains pivots, which denote the boundaries of 16 intervals, and slots, which serve as references to the VMA (Virtual Memory Area) structure when the node is considered a leaf node.
The maple tree introduces certain restrictions for concurrent modification, imposing the requirement of an exclusive lock for writers. In the case of the VMA tree, this exclusive lock corresponds to the MM write lock. As for readers, two options are available to them. The first option involves holding the MM read lock, which results in writers being blocked by the MM read-write lock. The second option is to enter the RCU critical section, allowing writers to proceed without being blocked, while readers can continue their operations due to the RCU-safe nature of the maple tree. While most existing VMA accesses typically choose the first option, the second option is employed in a few performance-critical scenarios, such as lockless page faults.
Though, there is an additional aspect that requires careful consideration, particularly regarding stack expansion. The stack represents a memory area that is mapped with the MAP_GROWSDOWN flag, indicating automatic expansion when an address below the region is accessed. In such cases, adjustments are made to the start address of the corresponding VMA and the associated interval within the maple tree. Notably, these adjustments are performed without holding the MM write lock.
static inline void do_user_addr_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address) { // ... if (unlikely(!mmap_read_trylock(mm))) { // ... } // ... if (unlikely(expand_stack(vma, address))) { // ... } // ... }
In the Linux kernel, a stack guard is typically enforced to create a gap between the stack Virtual Memory Area (VMA) and its adjacent VMA. When expanding the stack, the pivot value in the maple node can be updated atomically, given the presence of this gap. However, if the neighboring VMA also has the MAP_GROWSDOWN flag, there is no stack guard enforced, and the stack expansion can eliminate the existing gap. In such cases, it becomes necessary to remove the interval within the maple node corresponding to the gap.
int expand_downwards(struct vm_area_struct *vma, unsigned long address) { // ... if (prev) { if (!(prev->vm_flags & VM_GROWSDOWN) && vma_is_accessible(prev) && (address - prev->vm_end < stack_guard_gap)) return -ENOMEM; } // ... }
Since the maple tree implementation in the kernel is RCU-safe, it is not possible to overwrite the existing node in-place. Instead, a new node is created, which triggers node replacement. Subsequently, the old node is destroyed using an RCU callback mechanism. This ensures the safe and efficient handling of node updates within the maple tree data structure.
The problem occurs with the invocation of the RCU (Read-Copy-Update) callback, which occurs only after all pre-existing RCU critical sections have completed. However, when accessing Virtual Memory Areas (VMAs), the MM (Memory Management) read lock is the only lock held, and it does not enter the RCU critical section. This creates a potential issue where the RCU callback could be invoked at any time, potentially leading to the freeing of the old maple node. However, pointers to the old node may have already been obtained, resulting in a use-after-free bug when subsequent access is attempted.
The following backtrace depicts the specific location where the use-after-free (UAF) occurs:
Remediation
On June 15th, I responsibly reported this vulnerability to the Linux kernel security team. Linus Torvalds led the process of addressing the bug, considering its complexity. It took nearly two weeks to develop a series of patches that received consensus and were deemed suitable for resolving the issue.
On June 28th, during the merge window for Linux kernel 6.5, the fix was successfully incorporated into Linus' tree. Linus provided a detailed merge message, offering technical insights into the patch series and its implementation.
Following the merge, these patches were backported to stable kernels, specifically versions 6.1.37, 6.3.11, and 6.4.1. This backporting process ensured that the "Stack Rot" bug was effectively addressed and resolved on July 1st.
References
Fix was merged into Linus' tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9471f1f2f50282b9e8f59198ec6bb738b4ccc009
The updated 6.1.y, 6.3.y and 6.4.y git trees can be found at:
Git:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-6.1.y
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-6.3.y
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-6.4.y
Credit
@lrh2000 - Ruihan Li
@sochotnicky - Stanislav Ochotnický
Comments