| CVE |
Vendors |
Products |
Updated |
CVSS v3.1 |
| In the Linux kernel, the following vulnerability has been resolved:
net: mdio: Check regmap pointer returned by device_node_to_regmap()
The call to device_node_to_regmap() in airoha_mdio_probe() can return
an ERR_PTR() if regmap initialization fails. Currently, the driver
stores the pointer without validation, which could lead to a crash
if it is later dereferenced.
Add an IS_ERR() check and return the corresponding error code to make
the probe path more robust. |
| In the Linux kernel, the following vulnerability has been resolved:
drm/mediatek: Disable AFBC support on Mediatek DRM driver
Commit c410fa9b07c3 ("drm/mediatek: Add AFBC support to Mediatek DRM
driver") added AFBC support to Mediatek DRM and enabled the
32x8/split/sparse modifier.
However, this is currently broken on Mediatek MT8188 (Genio 700 EVK
platform); tested using upstream Kernel and Mesa (v25.2.1), AFBC is used by
default since Mesa v25.0.
Kernel trace reports vblank timeouts constantly, and the render is garbled:
```
[CRTC:62:crtc-0] vblank wait timed out
WARNING: CPU: 7 PID: 70 at drivers/gpu/drm/drm_atomic_helper.c:1835 drm_atomic_helper_wait_for_vblanks.part.0+0x24c/0x27c
[...]
Hardware name: MediaTek Genio-700 EVK (DT)
Workqueue: events_unbound commit_work
pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : drm_atomic_helper_wait_for_vblanks.part.0+0x24c/0x27c
lr : drm_atomic_helper_wait_for_vblanks.part.0+0x24c/0x27c
sp : ffff80008337bca0
x29: ffff80008337bcd0 x28: 0000000000000061 x27: 0000000000000000
x26: 0000000000000001 x25: 0000000000000000 x24: ffff0000c9dcc000
x23: 0000000000000001 x22: 0000000000000000 x21: ffff0000c66f2f80
x20: ffff0000c0d7d880 x19: 0000000000000000 x18: 000000000000000a
x17: 000000040044ffff x16: 005000f2b5503510 x15: 0000000000000000
x14: 0000000000000000 x13: 74756f2064656d69 x12: 742074696177206b
x11: 0000000000000058 x10: 0000000000000018 x9 : ffff800082396a70
x8 : 0000000000057fa8 x7 : 0000000000000cce x6 : ffff8000823eea70
x5 : ffff0001fef5f408 x4 : ffff80017ccee000 x3 : ffff0000c12cb480
x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000c12cb480
Call trace:
drm_atomic_helper_wait_for_vblanks.part.0+0x24c/0x27c (P)
drm_atomic_helper_commit_tail_rpm+0x64/0x80
commit_tail+0xa4/0x1a4
commit_work+0x14/0x20
process_one_work+0x150/0x290
worker_thread+0x2d0/0x3ec
kthread+0x12c/0x210
ret_from_fork+0x10/0x20
---[ end trace 0000000000000000 ]---
```
Until this gets fixed upstream, disable AFBC support on this platform, as
it's currently broken with upstream Mesa. |
| In the Linux kernel, the following vulnerability has been resolved:
wifi: iwlwifi: fix potential use after free in iwl_mld_remove_link()
This code frees "link" by calling kfree_rcu(link, rcu_head) and then it
dereferences "link" to get the "link->fw_id". Save the "link->fw_id"
first to avoid a potential use after free. |
| In the Linux kernel, the following vulnerability has been resolved:
s390: Disable ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
As reported by Luiz Capitulino enabling HVO on s390 leads to reproducible
crashes. The problem is that kernel page tables are modified without
flushing corresponding TLB entries.
Even if it looks like the empty flush_tlb_all() implementation on s390 is
the problem, it is actually a different problem: on s390 it is not allowed
to replace an active/valid page table entry with another valid page table
entry without the detour over an invalid entry. A direct replacement may
lead to random crashes and/or data corruption.
In order to invalidate an entry special instructions have to be used
(e.g. ipte or idte). Alternatively there are also special instructions
available which allow to replace a valid entry with a different valid
entry (e.g. crdte or cspg).
Given that the HVO code currently does not provide the hooks to allow for
an implementation which is compliant with the s390 architecture
requirements, disable ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP again, which is
basically a revert of the original patch which enabled it. |
| In the Linux kernel, the following vulnerability has been resolved:
netpoll: Fix deadlock in memory allocation under spinlock
Fix a AA deadlock in refill_skbs() where memory allocation while holding
skb_pool->lock can trigger a recursive lock acquisition attempt.
The deadlock scenario occurs when the system is under severe memory
pressure:
1. refill_skbs() acquires skb_pool->lock (spinlock)
2. alloc_skb() is called while holding the lock
3. Memory allocator fails and calls slab_out_of_memory()
4. This triggers printk() for the OOM warning
5. The console output path calls netpoll_send_udp()
6. netpoll_send_udp() attempts to acquire the same skb_pool->lock
7. Deadlock: the lock is already held by the same CPU
Call stack:
refill_skbs()
spin_lock_irqsave(&skb_pool->lock) <- lock acquired
__alloc_skb()
kmem_cache_alloc_node_noprof()
slab_out_of_memory()
printk()
console_flush_all()
netpoll_send_udp()
skb_dequeue()
spin_lock_irqsave(&skb_pool->lock) <- deadlock attempt
This bug was exposed by commit 248f6571fd4c51 ("netpoll: Optimize skb
refilling on critical path") which removed refill_skbs() from the
critical path (where nested printk was being deferred), letting nested
printk being called from inside refill_skbs()
Refactor refill_skbs() to never allocate memory while holding
the spinlock.
Another possible solution to fix this problem is protecting the
refill_skbs() from nested printks, basically calling
printk_deferred_{enter,exit}() in refill_skbs(), then, any nested
pr_warn() would be deferred.
I prefer this approach, given I _think_ it might be a good idea to move
the alloc_skb() from GFP_ATOMIC to GFP_KERNEL in the future, so, having
the alloc_skb() outside of the lock will be necessary step.
There is a possible TOCTOU issue when checking for the pool length, and
queueing the new allocated skb, but, this is not an issue, given that
an extra SKB in the pool is harmless and it will be eventually used. |
| In the Linux kernel, the following vulnerability has been resolved:
gpiolib: fix invalid pointer access in debugfs
If the memory allocation in gpiolib_seq_start() fails, the s->private
field remains uninitialized and is later dereferenced without checking
in gpiolib_seq_stop(). Initialize s->private to NULL before calling
kzalloc() and check it before dereferencing it. |
| In the Linux kernel, the following vulnerability has been resolved:
perf/x86/intel: Fix KASAN global-out-of-bounds warning
When running "perf mem record" command on CWF, the below KASAN
global-out-of-bounds warning is seen.
==================================================================
BUG: KASAN: global-out-of-bounds in cmt_latency_data+0x176/0x1b0
Read of size 4 at addr ffffffffb721d000 by task dtlb/9850
Call Trace:
kasan_report+0xb8/0xf0
cmt_latency_data+0x176/0x1b0
setup_arch_pebs_sample_data+0xf49/0x2560
intel_pmu_drain_arch_pebs+0x577/0xb00
handle_pmi_common+0x6c4/0xc80
The issue is caused by below code in __grt_latency_data(). The code
tries to access x86_hybrid_pmu structure which doesn't exist on
non-hybrid platform like CWF.
WARN_ON_ONCE(hybrid_pmu(event->pmu)->pmu_type == hybrid_big)
So add is_hybrid() check before calling this WARN_ON_ONCE to fix the
global-out-of-bounds access issue. |
| In the Linux kernel, the following vulnerability has been resolved:
sysfs: check visibility before changing group attribute ownership
Since commit 0c17270f9b92 ("net: sysfs: Implement is_visible for
phys_(port_id, port_name, switch_id)"), __dev_change_net_namespace() can
hit WARN_ON() when trying to change owner of a file that isn't visible.
See the trace below:
WARNING: CPU: 6 PID: 2938 at net/core/dev.c:12410 __dev_change_net_namespace+0xb89/0xc30
CPU: 6 UID: 0 PID: 2938 Comm: incusd Not tainted 6.17.1-1-mainline #1 PREEMPT(full) 4b783b4a638669fb644857f484487d17cb45ed1f
Hardware name: Framework Laptop 13 (AMD Ryzen 7040Series)/FRANMDCP07, BIOS 03.07 02/19/2025
RIP: 0010:__dev_change_net_namespace+0xb89/0xc30
[...]
Call Trace:
<TASK>
? if6_seq_show+0x30/0x50
do_setlink.isra.0+0xc7/0x1270
? __nla_validate_parse+0x5c/0xcc0
? security_capable+0x94/0x1a0
rtnl_newlink+0x858/0xc20
? update_curr+0x8e/0x1c0
? update_entity_lag+0x71/0x80
? sched_balance_newidle+0x358/0x450
? psi_task_switch+0x113/0x2a0
? __pfx_rtnl_newlink+0x10/0x10
rtnetlink_rcv_msg+0x346/0x3e0
? sched_clock+0x10/0x30
? __pfx_rtnetlink_rcv_msg+0x10/0x10
netlink_rcv_skb+0x59/0x110
netlink_unicast+0x285/0x3c0
? __alloc_skb+0xdb/0x1a0
netlink_sendmsg+0x20d/0x430
____sys_sendmsg+0x39f/0x3d0
? import_iovec+0x2f/0x40
___sys_sendmsg+0x99/0xe0
__sys_sendmsg+0x8a/0xf0
do_syscall_64+0x81/0x970
? __sys_bind+0xe3/0x110
? syscall_exit_work+0x143/0x1b0
? do_syscall_64+0x244/0x970
? sock_alloc_file+0x63/0xc0
? syscall_exit_work+0x143/0x1b0
? do_syscall_64+0x244/0x970
? alloc_fd+0x12e/0x190
? put_unused_fd+0x2a/0x70
? do_sys_openat2+0xa2/0xe0
? syscall_exit_work+0x143/0x1b0
? do_syscall_64+0x244/0x970
? exc_page_fault+0x7e/0x1a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
[...]
</TASK>
Fix this by checking is_visible() before trying to touch the attribute. |
| In the Linux kernel, the following vulnerability has been resolved:
net/mlx5e: RX, Fix generating skb from non-linear xdp_buff for striding RQ
XDP programs can change the layout of an xdp_buff through
bpf_xdp_adjust_tail() and bpf_xdp_adjust_head(). Therefore, the driver
cannot assume the size of the linear data area nor fragments. Fix the
bug in mlx5 by generating skb according to xdp_buff after XDP programs
run.
Currently, when handling multi-buf XDP, the mlx5 driver assumes the
layout of an xdp_buff to be unchanged. That is, the linear data area
continues to be empty and fragments remain the same. This may cause
the driver to generate erroneous skb or triggering a kernel
warning. When an XDP program added linear data through
bpf_xdp_adjust_head(), the linear data will be ignored as
mlx5e_build_linear_skb() builds an skb without linear data and then
pull data from fragments to fill the linear data area. When an XDP
program has shrunk the non-linear data through bpf_xdp_adjust_tail(),
the delta passed to __pskb_pull_tail() may exceed the actual nonlinear
data size and trigger the BUG_ON in it.
To fix the issue, first record the original number of fragments. If the
number of fragments changes after the XDP program runs, rewind the end
fragment pointer by the difference and recalculate the truesize. Then,
build the skb with the linear data area matching the xdp_buff. Finally,
only pull data in if there is non-linear data and fill the linear part
up to 256 bytes. |
| In the Linux kernel, the following vulnerability has been resolved:
platform/mellanox: mlxbf-pmc: add sysfs_attr_init() to count_clock init
The lock-related debug logic (CONFIG_LOCK_STAT) in the kernel is noting
the following warning when the BlueField-3 SOC is booted:
BUG: key ffff00008a3402a8 has not been registered!
------------[ cut here ]------------
DEBUG_LOCKS_WARN_ON(1)
WARNING: CPU: 4 PID: 592 at kernel/locking/lockdep.c:4801 lockdep_init_map_type+0x1d4/0x2a0
<snip>
Call trace:
lockdep_init_map_type+0x1d4/0x2a0
__kernfs_create_file+0x84/0x140
sysfs_add_file_mode_ns+0xcc/0x1cc
internal_create_group+0x110/0x3d4
internal_create_groups.part.0+0x54/0xcc
sysfs_create_groups+0x24/0x40
device_add+0x6e8/0x93c
device_register+0x28/0x40
__hwmon_device_register+0x4b0/0x8a0
devm_hwmon_device_register_with_groups+0x7c/0xe0
mlxbf_pmc_probe+0x1e8/0x3e0 [mlxbf_pmc]
platform_probe+0x70/0x110
The mlxbf_pmc driver must call sysfs_attr_init() during the
initialization of the "count_clock" data structure to avoid
this warning. |
| In the Linux kernel, the following vulnerability has been resolved:
arch_topology: Fix incorrect error check in topology_parse_cpu_capacity()
Fix incorrect use of PTR_ERR_OR_ZERO() in topology_parse_cpu_capacity()
which causes the code to proceed with NULL clock pointers. The current
logic uses !PTR_ERR_OR_ZERO(cpu_clk) which evaluates to true for both
valid pointers and NULL, leading to potential NULL pointer dereference
in clk_get_rate().
Per include/linux/err.h documentation, PTR_ERR_OR_ZERO(ptr) returns:
"The error code within @ptr if it is an error pointer; 0 otherwise."
This means PTR_ERR_OR_ZERO() returns 0 for both valid pointers AND NULL
pointers. Therefore !PTR_ERR_OR_ZERO(cpu_clk) evaluates to true (proceed)
when cpu_clk is either valid or NULL, causing clk_get_rate(NULL) to be
called when of_clk_get() returns NULL.
Replace with !IS_ERR_OR_NULL(cpu_clk) which only proceeds for valid
pointers, preventing potential NULL pointer dereference in clk_get_rate(). |
| In the Linux kernel, the following vulnerability has been resolved:
riscv: stacktrace: Disable KASAN checks for non-current tasks
Unwinding the stack of a task other than current, KASAN would report
"BUG: KASAN: out-of-bounds in walk_stackframe+0x41c/0x460"
There is a same issue on x86 and has been resolved by the commit
84936118bdf3 ("x86/unwind: Disable KASAN checks for non-current tasks")
The solution could be applied to RISC-V too.
This patch also can solve the issue:
https://seclists.org/oss-sec/2025/q4/23
[pjw@kernel.org: clean up checkpatch issues] |
| In the Linux kernel, the following vulnerability has been resolved:
mlx5: Fix default values in create CQ
Currently, CQs without a completion function are assigned the
mlx5_add_cq_to_tasklet function by default. This is problematic since
only user CQs created through the mlx5_ib driver are intended to use
this function.
Additionally, all CQs that will use doorbells instead of polling for
completions must call mlx5_cq_arm. However, the default CQ creation flow
leaves a valid value in the CQ's arm_db field, allowing FW to send
interrupts to polling-only CQs in certain corner cases.
These two factors would allow a polling-only kernel CQ to be triggered
by an EQ interrupt and call a completion function intended only for user
CQs, causing a null pointer exception.
Some areas in the driver have prevented this issue with one-off fixes
but did not address the root cause.
This patch fixes the described issue by adding defaults to the create CQ
flow. It adds a default dummy completion function to protect against
null pointer exceptions, and it sets an invalid command sequence number
by default in kernel CQs to prevent the FW from sending an interrupt to
the CQ until it is armed. User CQs are responsible for their own
initialization values.
Callers of mlx5_core_create_cq are responsible for changing the
completion function and arming the CQ per their needs. |
| In the Linux kernel, the following vulnerability has been resolved:
drm/xe/guc: Synchronize Dead CT worker with unbind
Cancel and wait for any Dead CT worker to complete before continuing
with device unbinding. Else the worker will end up using resources freed
by the undind operation.
(cherry picked from commit 492671339114e376aaa38626d637a2751cdef263) |
| In the Linux kernel, the following vulnerability has been resolved:
pmdomain: arm: scmi: Fix genpd leak on provider registration failure
If of_genpd_add_provider_onecell() fails during probe, the previously
created generic power domains are not removed, leading to a memory leak
and potential kernel crash later in genpd_debug_add().
Add proper error handling to unwind the initialized domains before
returning from probe to ensure all resources are correctly released on
failure.
Example crash trace observed without this fix:
| Unable to handle kernel paging request at virtual address fffffffffffffc70
| CPU: 1 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.18.0-rc1 #405 PREEMPT
| Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform
| pstate: 00000005 (nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : genpd_debug_add+0x2c/0x160
| lr : genpd_debug_init+0x74/0x98
| Call trace:
| genpd_debug_add+0x2c/0x160 (P)
| genpd_debug_init+0x74/0x98
| do_one_initcall+0xd0/0x2d8
| do_initcall_level+0xa0/0x140
| do_initcalls+0x60/0xa8
| do_basic_setup+0x28/0x40
| kernel_init_freeable+0xe8/0x170
| kernel_init+0x2c/0x140
| ret_from_fork+0x10/0x20 |
| In the Linux kernel, the following vulnerability has been resolved:
sched_ext: Fix unsafe locking in the scx_dump_state()
For built with CONFIG_PREEMPT_RT=y kernels, the dump_lock will be converted
sleepable spinlock and not disable-irq, so the following scenarios occur:
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
irq_work/0/27 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&rq->__lock){?...}-{2:2}, at: raw_spin_rq_lock_nested+0x2b/0x40
{IN-HARDIRQ-W} state was registered at:
lock_acquire+0x1e1/0x510
_raw_spin_lock_nested+0x42/0x80
raw_spin_rq_lock_nested+0x2b/0x40
sched_tick+0xae/0x7b0
update_process_times+0x14c/0x1b0
tick_periodic+0x62/0x1f0
tick_handle_periodic+0x48/0xf0
timer_interrupt+0x55/0x80
__handle_irq_event_percpu+0x20a/0x5c0
handle_irq_event_percpu+0x18/0xc0
handle_irq_event+0xb5/0x150
handle_level_irq+0x220/0x460
__common_interrupt+0xa2/0x1e0
common_interrupt+0xb0/0xd0
asm_common_interrupt+0x2b/0x40
_raw_spin_unlock_irqrestore+0x45/0x80
__setup_irq+0xc34/0x1a30
request_threaded_irq+0x214/0x2f0
hpet_time_init+0x3e/0x60
x86_late_time_init+0x5b/0xb0
start_kernel+0x308/0x410
x86_64_start_reservations+0x1c/0x30
x86_64_start_kernel+0x96/0xa0
common_startup_64+0x13e/0x148
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&rq->__lock);
<Interrupt>
lock(&rq->__lock);
*** DEADLOCK ***
stack backtrace:
CPU: 0 UID: 0 PID: 27 Comm: irq_work/0
Call Trace:
<TASK>
dump_stack_lvl+0x8c/0xd0
dump_stack+0x14/0x20
print_usage_bug+0x42e/0x690
mark_lock.part.44+0x867/0xa70
? __pfx_mark_lock.part.44+0x10/0x10
? string_nocheck+0x19c/0x310
? number+0x739/0x9f0
? __pfx_string_nocheck+0x10/0x10
? __pfx_check_pointer+0x10/0x10
? kvm_sched_clock_read+0x15/0x30
? sched_clock_noinstr+0xd/0x20
? local_clock_noinstr+0x1c/0xe0
__lock_acquire+0xc4b/0x62b0
? __pfx_format_decode+0x10/0x10
? __pfx_string+0x10/0x10
? __pfx___lock_acquire+0x10/0x10
? __pfx_vsnprintf+0x10/0x10
lock_acquire+0x1e1/0x510
? raw_spin_rq_lock_nested+0x2b/0x40
? __pfx_lock_acquire+0x10/0x10
? dump_line+0x12e/0x270
? raw_spin_rq_lock_nested+0x20/0x40
_raw_spin_lock_nested+0x42/0x80
? raw_spin_rq_lock_nested+0x2b/0x40
raw_spin_rq_lock_nested+0x2b/0x40
scx_dump_state+0x3b3/0x1270
? finish_task_switch+0x27e/0x840
scx_ops_error_irq_workfn+0x67/0x80
irq_work_single+0x113/0x260
irq_work_run_list.part.3+0x44/0x70
run_irq_workd+0x6b/0x90
? __pfx_run_irq_workd+0x10/0x10
smpboot_thread_fn+0x529/0x870
? __pfx_smpboot_thread_fn+0x10/0x10
kthread+0x305/0x3f0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x40/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
This commit therefore use rq_lock_irqsave/irqrestore() to replace
rq_lock/unlock() in the scx_dump_state(). |
| In the Linux kernel, the following vulnerability has been resolved:
fs/namespace: fix reference leak in grab_requested_mnt_ns
lookup_mnt_ns() already takes a reference on mnt_ns.
grab_requested_mnt_ns() doesn't need to take an extra reference. |
| In the Linux kernel, the following vulnerability has been resolved:
smb: client: fix memory leak in cifs_construct_tcon()
When having a multiuser mount with domain= specified and using
cifscreds, cifs_set_cifscreds() will end up setting @ctx->domainname,
so it needs to be freed before leaving cifs_construct_tcon().
This fixes the following memory leak reported by kmemleak:
mount.cifs //srv/share /mnt -o domain=ZELDA,multiuser,...
su - testuser
cifscreds add -d ZELDA -u testuser
...
ls /mnt/1
...
umount /mnt
echo scan > /sys/kernel/debug/kmemleak
cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff8881203c3f08 (size 8):
comm "ls", pid 5060, jiffies 4307222943
hex dump (first 8 bytes):
5a 45 4c 44 41 00 cc cc ZELDA...
backtrace (crc d109a8cf):
__kmalloc_node_track_caller_noprof+0x572/0x710
kstrdup+0x3a/0x70
cifs_sb_tlink+0x1209/0x1770 [cifs]
cifs_get_fattr+0xe1/0xf50 [cifs]
cifs_get_inode_info+0xb5/0x240 [cifs]
cifs_revalidate_dentry_attr+0x2d1/0x470 [cifs]
cifs_getattr+0x28e/0x450 [cifs]
vfs_getattr_nosec+0x126/0x180
vfs_statx+0xf6/0x220
do_statx+0xab/0x110
__x64_sys_statx+0xd5/0x130
do_syscall_64+0xbb/0x380
entry_SYSCALL_64_after_hwframe+0x77/0x7f |
| In the Linux kernel, the following vulnerability has been resolved:
nilfs2: avoid having an active sc_timer before freeing sci
Because kthread_stop did not stop sc_task properly and returned -EINTR,
the sc_timer was not properly closed, ultimately causing the problem [1]
reported by syzbot when freeing sci due to the sc_timer not being closed.
Because the thread sc_task main function nilfs_segctor_thread() returns 0
when it succeeds, when the return value of kthread_stop() is not 0 in
nilfs_segctor_destroy(), we believe that it has not properly closed
sc_timer.
We use timer_shutdown_sync() to sync wait for sc_timer to shutdown, and
set the value of sc_task to NULL under the protection of lock
sc_state_lock, so as to avoid the issue caused by sc_timer not being
properly shutdowned.
[1]
ODEBUG: free active (active state 0) object: 00000000dacb411a object type: timer_list hint: nilfs_construction_timeout
Call trace:
nilfs_segctor_destroy fs/nilfs2/segment.c:2811 [inline]
nilfs_detach_log_writer+0x668/0x8cc fs/nilfs2/segment.c:2877
nilfs_put_super+0x4c/0x12c fs/nilfs2/super.c:509 |
| In the Linux kernel, the following vulnerability has been resolved:
ipv4: route: Prevent rt_bind_exception() from rebinding stale fnhe
The sit driver's packet transmission path calls: sit_tunnel_xmit() ->
update_or_create_fnhe(), which lead to fnhe_remove_oldest() being called
to delete entries exceeding FNHE_RECLAIM_DEPTH+random.
The race window is between fnhe_remove_oldest() selecting fnheX for
deletion and the subsequent kfree_rcu(). During this time, the
concurrent path's __mkroute_output() -> find_exception() can fetch the
soon-to-be-deleted fnheX, and rt_bind_exception() then binds it with a
new dst using a dst_hold(). When the original fnheX is freed via RCU,
the dst reference remains permanently leaked.
CPU 0 CPU 1
__mkroute_output()
find_exception() [fnheX]
update_or_create_fnhe()
fnhe_remove_oldest() [fnheX]
rt_bind_exception() [bind dst]
RCU callback [fnheX freed, dst leak]
This issue manifests as a device reference count leak and a warning in
dmesg when unregistering the net device:
unregister_netdevice: waiting for sitX to become free. Usage count = N
Ido Schimmel provided the simple test validation method [1].
The fix clears 'oldest->fnhe_daddr' before calling fnhe_flush_routes().
Since rt_bind_exception() checks this field, setting it to zero prevents
the stale fnhe from being reused and bound to a new dst just before it
is freed.
[1]
ip netns add ns1
ip -n ns1 link set dev lo up
ip -n ns1 address add 192.0.2.1/32 dev lo
ip -n ns1 link add name dummy1 up type dummy
ip -n ns1 route add 192.0.2.2/32 dev dummy1
ip -n ns1 link add name gretap1 up arp off type gretap \
local 192.0.2.1 remote 192.0.2.2
ip -n ns1 route add 198.51.0.0/16 dev gretap1
taskset -c 0 ip netns exec ns1 mausezahn gretap1 \
-A 198.51.100.1 -B 198.51.0.0/16 -t udp -p 1000 -c 0 -q &
taskset -c 2 ip netns exec ns1 mausezahn gretap1 \
-A 198.51.100.1 -B 198.51.0.0/16 -t udp -p 1000 -c 0 -q &
sleep 10
ip netns pids ns1 | xargs kill
ip netns del ns1 |