Learning kernel debugging through issues discovered in the QEMU environment
Supplementary MaterialThe DPAS patch applies research published at USENIX FAST '26 to the Linux kernel 5.18. Because the research environment (real NVMe SSDs, specific kernel version) differs from our lab environment (QEMU virtual NVMe, GCC 14), applying the patch as-is can cause build errors and runtime misbehavior.
This document describes three key issues found while porting dpas.patch to the QEMU environment. The symptom → root cause → fix → verification process mirrors real-world kernel debugging workflows.
| Aspect | Research Environment | QEMU Lab Environment |
|---|---|---|
| Storage | Real NVMe SSD | QEMU virtual NVMe (RAM-backed) |
| Timer precision | TSC-based sub-μs | Emulation overhead: several μs |
| I/O scheduler | none | none |
| WBT (Write Back Throttling) | Enabled (wbt_lat_usec=2000) | Unsupported (Invalid argument) |
| GCC version | GCC 11-12 | GCC 14 (Ubuntu 24.04) |
Even after enabling DPAS, no mode transitions occur.
switch_stat shows pas io: 0, cp io: 0 —
I/O never enters the DPAS state machine.
DPAS determines a statistics bucket based on I/O request size via blk_mq_poll_stats_bkt().
The original code uses blk_rq_stats_sectors(rq), which simply returns rq->stats_sectors.
This field is only populated in blk_mq_start_request() when QUEUE_FLAG_STATS is set:
// block/blk-mq.c — blk_mq_start_request()
if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
rq->io_start_time_ns = ktime_get_ns();
rq->stats_sectors = blk_rq_sectors(rq); // only set here
}
This flag is activated by components that call blk_stat_add_callback():
wbt_lat_usec ≠ 0
In the DPAS research environment, the I/O scheduler is none, but
WBT is active (wbt_lat_usec=2000), which sets QUEUE_FLAG_STATS
and makes blk_rq_stats_sectors() work correctly.
In the QEMU virtual NVMe device, WBT is not supported
(cat wbt_lat_usec → Invalid argument),
so QUEUE_FLAG_STATS is never set and
blk_rq_stats_sectors() always returns 0,
causing bucket = -1 and skipping all DPAS logic.
block/blk-wbt.c).
It prevents write starvation — when heavy buffered writes saturate a device,
read latency spikes dramatically.
WBT dynamically limits the number of in-flight write I/Os based on observed latency.
blk_stat_add_callback(), which activates QUEUE_FLAG_STATS
and causes every request to record stats_sectors and io_start_time_ns.
wbt_lat_usec is WBT's target latency in microseconds.
When non-zero, WBT is active, and as a side effect, QUEUE_FLAG_STATS is enabled.
Real NVMe SSDs typically have WBT enabled by default, while QEMU virtual NVMe devices do not support it.
// block/blk-mq.c — blk_mq_poll_stats_bkt()
- sectors = blk_rq_stats_sectors(rq);
+ sectors = blk_rq_sectors(rq);
blk_rq_sectors(rq) always returns the request's sector count regardless of QUEUE_FLAG_STATS.
blk_rq_stats_sectors depends on QUEUE_FLAG_STATS (activated by WBT, I/O schedulers, etc.),
while blk_rq_sectors always works.
To understand why the same code behaves differently across environments,
you must trace back where the return value is actually populated.
After Fix 1, DPAS mode transitions still do not occur. The system stays in MODE[2] (PAS).
By default, the NVMe driver creates no poll-dedicated hardware queues.
Without poll queues, --hipri requests are routed to interrupt queues and polling is impossible.
sudo modprobe -r nvme
sudo modprobe nvme poll_queues=2
The default io_poll_delay=-1 (BLK_MQ_POLL_CLASSIC) causes blk_mq_poll()
to skip blk_mq_poll_hybrid() entirely.
All DPAS logic — QD tracking, mode evaluation, adaptive sleep — resides in the hybrid polling path.
echo 0 | sudo tee /sys/block/nvme0n1/queue/io_poll_delay
poll_queues ≥ 1)io_poll_delay ≥ 0)
With Fixes 1 and 2 applied, single-job (numjobs=1) DPAS works correctly.
However, with multi-job (numjobs ≥ 2, same CPU), PAS mode should be maintained
but incorrectly transitions to CP mode.
The hybrid polling sleep loop:
// block/blk-mq.c — blk_mq_poll_hybrid() original
do {
set_current_state(TASK_UNINTERRUPTIBLE);
hrtimer_sleeper_start_expires(&hs, mode); // (1) arm timer
if (hs.task) // (2) check expiry
io_schedule(); // (3) yield CPU
hrtimer_cancel(&hs.timer);
} while (hs.task && !signal_pending(current));
On QEMU, the timer arm process (spinlock, rb-tree insertion, ktime_get())
takes several μs. A 100ns timer expires during the arm process itself,
causing hs.task = NULL before step (2).
Result: io_schedule() is skipped, no CPU yield, other jobs cannot run, QD always appears as 1.
| d_init (ns) | No sleep | Slept | Sleep ratio | Verdict |
|---|---|---|---|---|
| 100 | 5,899 | 0 | 0% | Never sleeps |
| 1,000 | 5,998 | 0 | 0% | Never sleeps |
| 2,000 | 10,785 | 15 | 0.1% | Barely sleeps |
| 3,960 | 115 | 55,273 | 99.8% | Normal sleep |
// block/blk-mq.c — blk_mq_poll_hybrid() modified
do {
set_current_state(TASK_UNINTERRUPTIBLE);
hrtimer_sleeper_start_expires(&hs, mode);
- if (hs.task)
- io_schedule();
+ io_schedule(); // always yield CPU
hrtimer_cancel(&hs.timer);
} while (hs.task && !signal_pending(current));
When the timer has already expired, wake_up_process() sets the task state to
TASK_RUNNING, so io_schedule() acts as a simple yield and returns immediately.
This change only affects the hybrid polling path and has no impact on the rest of the kernel.
tf counter increment.
With default settings (param1=0), this triggers PAS→OL→INT transitions,
effectively degrading DPAS to interrupt mode for multi-job workloads.
On real hardware with adequate hrtimer resolution, this fix may be unnecessary.
| Mode | IOPS (j=1) | Note |
|---|---|---|
| INT | ~18,000 | Interrupt-based |
| CP | ~22,000 | Classic polling |
| PAS | ~22,000 | Same as CP (DPAS inactive) |
| DPAS | ~27,000 | Similar to CP (no transitions) |
| Mode | IOPS (j=1) | Note |
|---|---|---|
| INT | ~22,000 | Interrupt-based (pvsync2, no hipri) |
| CP | ~37,000 | Classic polling (io_poll_delay=-1) |
| PAS | ~31,000 | Adaptive sleep + poll |
| DPAS | ~34,000 | CP:PAS = 10:1 transitions working |
| jobs | INT | CP | PAS | DPAS |
|---|---|---|---|---|
| 1 | 21,879 | 36,633 | 30,628 | 34,250 |
| 2 | 42,950 | 33,955 | 49,494 | 42,565 |
| 4 | 61,739 | 32,908 | 44,379 | 67,800 |
| 8 | 83,633 | 38,007 | 50,194 | 74,615 |
| 16 | 84,183 | 37,433 | 57,368 | ~76,000 |
QUEUE_FLAG_STATS, poll_queues, io_poll_delay one by one.switch_stat, io_poll_delay without rebuilding the kernel.if statement. Kernel changes should have minimal blast radius.The same source code can behave differently across environments. For hrtimer, a 100ns timer works correctly on real hardware but is effectively nullified on QEMU due to arm overhead. This illustrates the limits of hardware abstraction: kernel code assumes nanosecond timer precision, but this assumption breaks under virtualization.
The always-yield fix solves the QEMU-specific problem but introduces a new side effect (PAS→OL→INT degradation in multi-job scenarios). Kernel modifications can solve one problem while creating another, requiring analysis across the entire state machine.
| Fix | Location | Change Size | Effect |
|---|---|---|---|
| blk_rq_stats_sectors → blk_rq_sectors | blk-mq.c | 1 line | DPAS state machine entry enabled |
| io_poll_delay=0, poll_queues=2 | sysfs config | Runtime | Hybrid polling path activated |
| Remove if (hs.task) | blk-mq.c | 1 line | CPU yield guaranteed on QEMU multi-job |
The three fixes described in this document were developed with the help of an AI coding tool (Claude Code). The AI tool was effective at analyzing kernel source code, inferring causes from symptoms, and proposing fixes. However, there was a case where the AI's explanation was inaccurate, and this section describes how the researcher identified and corrected it.
| Aspect | AI's Assessment | Reality |
|---|---|---|
| Problem identification | Correct — identified that blk_rq_stats_sectors() returns 0, skipping DPAS logic |
Correct |
| Proposed fix | Correct — suggested replacing with blk_rq_sectors() |
Correct. DPAS worked after the fix |
| Causal explanation | Inaccurate — claimed "I/O schedulers like BFQ/kyber activate the flag;
QEMU's none scheduler leaves it inactive" |
The research environment also uses none scheduler. Schedulers were not the cause |
Upon receiving the AI's explanation, the researcher raised a critical question:
none scheduler.
If the scheduler were the cause, the same problem should have occurred on native hardware as well."
This question led to direct verification on the actual research server:
# Verification on native research server
$ cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline ← none scheduler confirmed
$ cat /sys/block/nvme0n1/queue/wbt_lat_usec
2000 ← WBT is active
# Verification on QEMU environment
$ cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline kyber ← same none scheduler
$ cat /sys/block/nvme0n1/queue/wbt_lat_usec
cat: Invalid argument ← WBT not supported!
The actual cause was not the I/O scheduler but WBT (Write Back Throttling).
On the native server, WBT was active and setting QUEUE_FLAG_STATS.
On the QEMU virtual NVMe device, WBT itself was unsupported, leaving the flag unset.
blk_rq_stats_sectors with blk_rq_sectors was correct,
and it made DPAS actually work.
AI tools are efficient at narrowing down candidate causes by analyzing thousands of lines of kernel source.
QUEUE_FLAG_STATS is activated by scheduler callbacks,
and reasoned that "QEMU uses none scheduler, so it's inactive."
This reasoning was partially correct but missed the key factor (WBT).
The code analysis was right, but the AI could not infer the runtime state of a specific environment.
none scheduler?" could only be raised by
someone who actually knows the experimental setup of that system.
AI tools can read code, but they cannot know the runtime configuration of a specific server.
Correctness of a fix and correctness of its explanation are separate concerns,
and both must be verified.