dpas.patch Improvement Process

Learning kernel debugging through issues discovered in the QEMU environment

Supplementary Material
Korean
← Back to DPAS Lab Guide

1. Overview: Applying Research Code in Practice

The DPAS patch applies research published at USENIX FAST '26 to the Linux kernel 5.18. Because the research environment (real NVMe SSDs, specific kernel version) differs from our lab environment (QEMU virtual NVMe, GCC 14), applying the patch as-is can cause build errors and runtime misbehavior.

This document describes three key issues found while porting dpas.patch to the QEMU environment. The symptom → root cause → fix → verification process mirrors real-world kernel debugging workflows.

Research vs. Lab Environment
AspectResearch EnvironmentQEMU Lab Environment
StorageReal NVMe SSDQEMU virtual NVMe (RAM-backed)
Timer precisionTSC-based sub-μsEmulation overhead: several μs
I/O schedulernonenone
WBT (Write Back Throttling)Enabled (wbt_lat_usec=2000)Unsupported (Invalid argument)
GCC versionGCC 11-12GCC 14 (Ubuntu 24.04)

2. Fix 1: blk_rq_stats_sectors → blk_rq_sectors

Symptom

Even after enabling DPAS, no mode transitions occur. switch_stat shows pas io: 0, cp io: 0 — I/O never enters the DPAS state machine.

Root Cause

DPAS determines a statistics bucket based on I/O request size via blk_mq_poll_stats_bkt(). The original code uses blk_rq_stats_sectors(rq), which simply returns rq->stats_sectors. This field is only populated in blk_mq_start_request() when QUEUE_FLAG_STATS is set:

// block/blk-mq.c — blk_mq_start_request()
if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
    rq->io_start_time_ns = ktime_get_ns();
    rq->stats_sectors = blk_rq_sectors(rq);  // only set here
}

What enables QUEUE_FLAG_STATS?

This flag is activated by components that call blk_stat_add_callback():

In the DPAS research environment, the I/O scheduler is none, but WBT is active (wbt_lat_usec=2000), which sets QUEUE_FLAG_STATS and makes blk_rq_stats_sectors() work correctly.

In the QEMU virtual NVMe device, WBT is not supported (cat wbt_lat_usecInvalid argument), so QUEUE_FLAG_STATS is never set and blk_rq_stats_sectors() always returns 0, causing bucket = -1 and skipping all DPAS logic.

What is WBT (Write Back Throttling)?

WBT Overview WBT is a buffered write throttling mechanism in the Linux block layer (block/blk-wbt.c). It prevents write starvation — when heavy buffered writes saturate a device, read latency spikes dramatically. WBT dynamically limits the number of in-flight write I/Os based on observed latency.

To function, WBT must track completion latency of every I/O request. It does this by calling blk_stat_add_callback(), which activates QUEUE_FLAG_STATS and causes every request to record stats_sectors and io_start_time_ns.

wbt_lat_usec is WBT's target latency in microseconds. When non-zero, WBT is active, and as a side effect, QUEUE_FLAG_STATS is enabled. Real NVMe SSDs typically have WBT enabled by default, while QEMU virtual NVMe devices do not support it.

Fix

// block/blk-mq.c — blk_mq_poll_stats_bkt()
- sectors = blk_rq_stats_sectors(rq);
+ sectors = blk_rq_sectors(rq);

blk_rq_sectors(rq) always returns the request's sector count regardless of QUEUE_FLAG_STATS.

Lesson The kernel contains functions with similar names but different preconditions. blk_rq_stats_sectors depends on QUEUE_FLAG_STATS (activated by WBT, I/O schedulers, etc.), while blk_rq_sectors always works. To understand why the same code behaves differently across environments, you must trace back where the return value is actually populated.

3. Fix 2: NVMe poll_queues and io_poll_delay

Symptom

After Fix 1, DPAS mode transitions still do not occur. The system stays in MODE[2] (PAS).

Root Cause: Two Missing Prerequisites

(a) poll_queues=0 — No poll queues exist

By default, the NVMe driver creates no poll-dedicated hardware queues. Without poll queues, --hipri requests are routed to interrupt queues and polling is impossible.

sudo modprobe -r nvme
sudo modprobe nvme poll_queues=2

(b) io_poll_delay=-1 — Hybrid polling disabled

The default io_poll_delay=-1 (BLK_MQ_POLL_CLASSIC) causes blk_mq_poll() to skip blk_mq_poll_hybrid() entirely. All DPAS logic — QD tracking, mode evaluation, adaptive sleep — resides in the hybrid polling path.

echo 0 | sudo tee /sys/block/nvme0n1/queue/io_poll_delay
Key Point DPAS code path entry requires two prerequisites:
  1. NVMe poll queues exist (poll_queues ≥ 1)
  2. Hybrid polling enabled (io_poll_delay ≥ 0)

4. Fix 3: hrtimer Resolution and the Sleep Loop

Symptom

With Fixes 1 and 2 applied, single-job (numjobs=1) DPAS works correctly. However, with multi-job (numjobs ≥ 2, same CPU), PAS mode should be maintained but incorrectly transitions to CP mode.

Root Cause

The hybrid polling sleep loop:

// block/blk-mq.c — blk_mq_poll_hybrid() original
do {
    set_current_state(TASK_UNINTERRUPTIBLE);
    hrtimer_sleeper_start_expires(&hs, mode);   // (1) arm timer
    if (hs.task)                                // (2) check expiry
        io_schedule();                          // (3) yield CPU
    hrtimer_cancel(&hs.timer);
} while (hs.task && !signal_pending(current));

On QEMU, the timer arm process (spinlock, rb-tree insertion, ktime_get()) takes several μs. A 100ns timer expires during the arm process itself, causing hs.task = NULL before step (2). Result: io_schedule() is skipped, no CPU yield, other jobs cannot run, QD always appears as 1.

Measured hrtimer Resolution on QEMU

d_init (ns)No sleepSleptSleep ratioVerdict
1005,89900%Never sleeps
1,0005,99800%Never sleeps
2,00010,785150.1%Barely sleeps
3,96011555,27399.8%Normal sleep

Fix: Always-Yield

// block/blk-mq.c — blk_mq_poll_hybrid() modified
do {
    set_current_state(TASK_UNINTERRUPTIBLE);
    hrtimer_sleeper_start_expires(&hs, mode);
-   if (hs.task)
-       io_schedule();
+   io_schedule();  // always yield CPU
    hrtimer_cancel(&hs.timer);
} while (hs.task && !signal_pending(current));

When the timer has already expired, wake_up_process() sets the task state to TASK_RUNNING, so io_schedule() acts as a simple yield and returns immediately. This change only affects the hybrid polling path and has no impact on the rest of the kernel.

Limitation of Always-Yield While this fix ensures CPU yield in multi-job scenarios, it has a side effect: I/O may complete during the yield, causing "overslept" detection and tf counter increment. With default settings (param1=0), this triggers PAS→OL→INT transitions, effectively degrading DPAS to interrupt mode for multi-job workloads. On real hardware with adequate hrtimer resolution, this fix may be unnecessary.

5. Benchmark Comparison: Before and After

Before Fixes

ModeIOPS (j=1)Note
INT~18,000Interrupt-based
CP~22,000Classic polling
PAS~22,000Same as CP (DPAS inactive)
DPAS~27,000Similar to CP (no transitions)

After All Fixes

ModeIOPS (j=1)Note
INT~22,000Interrupt-based (pvsync2, no hipri)
CP~37,000Classic polling (io_poll_delay=-1)
PAS~31,000Adaptive sleep + poll
DPAS~34,000CP:PAS = 10:1 transitions working

Multi-Job Performance (3-run avg, CPU 0 pinned)

jobsINTCPPASDPAS
121,87936,63330,62834,250
242,95033,95549,49442,565
461,73932,90844,37967,800
883,63338,00750,19474,615
1684,18337,43357,368~76,000
Key Observations
  • CP: Busy-poll monopolizes CPU. No multi-job scaling (~37K fixed).
  • INT: Interrupt-based, natural CPU sharing. Near-linear multi-job scaling.
  • PAS: Sleep-poll hybrid, better scaling than CP.
  • DPAS j=1: CP:PAS = 10:1 mode transitions. Slightly lower than pure CP due to PAS sleep overhead.
  • DPAS j≥2: Always-yield side effect causes PAS→OL→INT, effectively becoming INT mode.

6. Lessons Learned

Kernel Debugging Methodology

  1. Describe symptoms precisely — Not "transitions don't work" but "switch_stat shows pas io=0, bucket returns -1."
  2. Trace the code path — I/O submit (fops.c) → poll entry (blk-mq.c) → sleep loop → completion. Identify where it stalls.
  3. Verify preconditions — Check QUEUE_FLAG_STATS, poll_queues, io_poll_delay one by one.
  4. Use sysfs — Inspect runtime state via switch_stat, io_poll_delay without rebuilding the kernel.
  5. Minimize changes — Fix 1 changes one function call, Fix 3 removes one if statement. Kernel changes should have minimal blast radius.

Environment Dependence of Systems Software

The same source code can behave differently across environments. For hrtimer, a 100ns timer works correctly on real hardware but is effectively nullified on QEMU due to arm overhead. This illustrates the limits of hardware abstraction: kernel code assumes nanosecond timer precision, but this assumption breaks under virtualization.

Trade-off Analysis

The always-yield fix solves the QEMU-specific problem but introduces a new side effect (PAS→OL→INT degradation in multi-job scenarios). Kernel modifications can solve one problem while creating another, requiring analysis across the entire state machine.

Summary
FixLocationChange SizeEffect
blk_rq_stats_sectors → blk_rq_sectorsblk-mq.c1 lineDPAS state machine entry enabled
io_poll_delay=0, poll_queues=2sysfs configRuntimeHybrid polling path activated
Remove if (hs.task)blk-mq.c1 lineCPU yield guaranteed on QEMU multi-job

7. Case Study: Proper Use of AI Tools

The three fixes described in this document were developed with the help of an AI coding tool (Claude Code). The AI tool was effective at analyzing kernel source code, inferring causes from symptoms, and proposing fixes. However, there was a case where the AI's explanation was inaccurate, and this section describes how the researcher identified and corrected it.

What Was Right, What Was Wrong

AspectAI's AssessmentReality
Problem identification Correct — identified that blk_rq_stats_sectors() returns 0, skipping DPAS logic Correct
Proposed fix Correct — suggested replacing with blk_rq_sectors() Correct. DPAS worked after the fix
Causal explanation Inaccurate — claimed "I/O schedulers like BFQ/kyber activate the flag; QEMU's none scheduler leaves it inactive" The research environment also uses none scheduler. Schedulers were not the cause

The Researcher's Verification

Upon receiving the AI's explanation, the researcher raised a critical question:

"The native research environment also uses the default none scheduler. If the scheduler were the cause, the same problem should have occurred on native hardware as well."

This question led to direct verification on the actual research server:

# Verification on native research server
$ cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline        ← none scheduler confirmed

$ cat /sys/block/nvme0n1/queue/wbt_lat_usec
2000                           ← WBT is active

# Verification on QEMU environment
$ cat /sys/block/nvme0n1/queue/scheduler
[none] mq-deadline kyber  ← same none scheduler

$ cat /sys/block/nvme0n1/queue/wbt_lat_usec
cat: Invalid argument        ← WBT not supported!

The actual cause was not the I/O scheduler but WBT (Write Back Throttling). On the native server, WBT was active and setting QUEUE_FLAG_STATS. On the QEMU virtual NVMe device, WBT itself was unsupported, leaving the flag unset.

What This Case Illustrates

  1. AI can quickly find "working fixes" — the replacement of blk_rq_stats_sectors with blk_rq_sectors was correct, and it made DPAS actually work. AI tools are efficient at narrowing down candidate causes by analyzing thousands of lines of kernel source.
  2. AI's "explanations" require verification — the AI correctly identified that QUEUE_FLAG_STATS is activated by scheduler callbacks, and reasoned that "QEMU uses none scheduler, so it's inactive." This reasoning was partially correct but missed the key factor (WBT). The code analysis was right, but the AI could not infer the runtime state of a specific environment.
  3. Final verification is the researcher's responsibility — the question "but native also uses none scheduler?" could only be raised by someone who actually knows the experimental setup of that system. AI tools can read code, but they cannot know the runtime configuration of a specific server. Correctness of a fix and correctness of its explanation are separate concerns, and both must be verified.
Principles for Using AI Tools in Systems Research
  • Use AI as a tool, but make judgments yourself — before applying an AI-suggested fix, ensure you understand why the fix is necessary.
  • Question the explanation even when the fix works — "it's fixed" and "the cause is correctly understood" are different things. A working fix can be based on incorrect reasoning.
  • Verify environment-dependent facts directly — AI can analyze code, but runtime states of specific hardware, driver implementation differences, and experimental configurations must be verified in person.