Minimizing Context Switching Overhead
Most performance debugging starts with the obvious stuff. Is the CPU pegged? Is memory tight? Are we waiting on I/O? And when none of those explain the latency you’re seeing, you start pulling your hair out.
Context switches are one of those things that hide in the gap between “the system looks fine” and “the system is slower than it should be.” They don’t show up in your application metrics. They barely show up in top. But they’re real, and on a busy server they can waste a meaningful chunk of your CPU budget.
The short version#
When the kernel switches a core from running one thread to another, it saves and restores a bunch of CPU registers. That costs maybe 1 to 3 µs. Fine, whatever.
The expensive part is what happens after. The new thread’s data probably isn’t in L1/L2 cache. Its TLB entries are stale. So it spends its first few hundred microseconds doing memory fetches instead of real work. Multiply that by a few thousand switches per second per core and you’re losing real time.
How to check#
The go-to advice is perf stat -e context-switches, but on most systems perf_event_paranoid is set to 2 by default and you’ll just get a permissions error. You can either lower it (sudo sysctl kernel.perf_event_paranoid=-1) or run perf as root, but that’s not always an option.
What always works without any special permissions is /proc/<pid>/status:
$ grep ctxt /proc/12345/status
voluntary_ctxt_switches: 182943
nonvoluntary_ctxt_switches: 47832
(Replace 12345 with your actual PID. Use pidof or pgrep to find it.)
To get a rate, sample it twice:
$ grep ctxt /proc/12345/status
voluntary_ctxt_switches: 182943
nonvoluntary_ctxt_switches: 47832
# wait 10 seconds, then run it again
$ grep ctxt /proc/12345/status
voluntary_ctxt_switches: 183209
nonvoluntary_ctxt_switches: 95671
The nonvoluntary switches are the ones that matter here. That’s the kernel yanking your thread off a core to give it to someone else. In this example, roughly 4800/sec over those 10 seconds. Voluntary switches (your thread blocked on I/O or a lock) are usually fine.
If you do have root and perf access:
$ sudo perf stat -e context-switches,cpu-migrations -p 12345 -- sleep 10
The cpu-migrations number is worth watching too. That’s threads hopping between physical cores, which is even worse for cache locality.
Either way, if you’re above 1000 nonvoluntary switches/sec/core on a CPU-bound service, that’s worth looking into.
It’s almost always too many threads#
The number one cause I’ve seen is thread pools that are way bigger than the core count. Someone sets it to 64 or 128 “just in case” on a 4-core container and wonders why throughput plateaus.
More threads than cores on CPU-bound work just means the scheduler is rotating threads on and off each core constantly. Every rotation is a context switch. Every context switch trashes the cache.
Size your pool to 1 to 2x your core count. Measure throughput as you adjust. It’s not complicated, it’s just that nobody does it.
Stop reading outdated scheduler advice#
This is the part that actually motivated this post. Half the articles out there about Linux scheduling tell you to tune sched_latency_ns and sched_min_granularity_ns under /proc/sys/kernel/. Those don’t exist anymore. They got moved to debugfs in kernel 5.13, and then kernel 6.6 replaced CFS entirely with EEVDF.
If you’re on any distro released in 2024 or 2025 you’re almost certainly on EEVDF. Check for yourself:
$ ls /proc/sys/kernel/sched_*
/proc/sys/kernel/sched_autogroup_enabled
/proc/sys/kernel/sched_cfs_bandwidth_slice_us
/proc/sys/kernel/sched_deadline_period_max_us
/proc/sys/kernel/sched_deadline_period_min_us
/proc/sys/kernel/sched_energy_aware
/proc/sys/kernel/sched_rr_timeslice_ms
/proc/sys/kernel/sched_rt_period_us
/proc/sys/kernel/sched_rt_runtime_us
/proc/sys/kernel/sched_schedstats
/proc/sys/kernel/sched_util_clamp_max
/proc/sys/kernel/sched_util_clamp_min
/proc/sys/kernel/sched_util_clamp_min_rt_default
No sched_latency_ns. No sched_min_granularity_ns. Gone.
EEVDF tracks each task’s “lag” (how much CPU time it’s owed vs. what it got) and assigns virtual deadlines. Tasks that are behind get earlier deadlines. It’s less heuristic-driven than CFS was, which is why most of the old knobs got removed.
The one tunable that survived is base_slice_ns under debugfs:
$ cat /sys/kernel/debug/sched/base_slice_ns
3000000
Bigger value = longer time slices = fewer switches but worse latency for other tasks. You can bump it for backend services, but honestly you probably don’t need to. Fix your thread count first.
Kubernetes defaults are bad for this#
Default CPU manager policy is none, meaning all pods share all cores. Your threads get bounced around between cores freely, which means context switches and cpu migrations (cold caches on a completely different core).
For CPU-bound, latency-sensitive stuff, use the static policy:
# kubelet config
cpuManagerPolicy: static
kubeReserved:
cpu: "1"
systemReserved:
cpu: "1"
Pods with integer CPU requests equal to limits (Guaranteed QoS) get pinned to exclusive cores. No sharing, no migrations. You do need to reserve cores for system stuff or the node gets unhappy.
Only Guaranteed pods get this treatment. Everything else still shares.
Check your cgroup throttling too#
Cgroup CPU quotas can cause context switches even if your thread count is fine. When you hit quota, the kernel suspends all your threads until the next period. When they come back, caches are cold.
$ cat /sys/fs/cgroup/cpu.stat
usage_usec 11565043278
user_usec 8040824681
system_usec 3524218596
nice_usec 5061938
core_sched.force_idle_usec 0
nr_periods 0
nr_throttled 0
throttled_usec 0
nr_bursts 0
burst_usec 0
nr_throttled vs nr_periods is the ratio you want. If it’s significant, your quota is too tight. Both being 0 (like above) means bandwidth control isn’t limiting you.
On older cgroup v1 systems the file is /sys/fs/cgroup/cpu/cpu.stat with slightly different field names (throttled_time instead of throttled_usec).
That’s basically it#
Grab a PID, grep ctxt /proc/<pid>/status, wait a few seconds, do it again. If nonvoluntary switches are high, shrink your thread pool. If you’re on Kubernetes, consider static CPU policy. Check cpu.stat for throttling.
None of this is hard. The hard part is remembering to look.