Part 4: DPAS Lab

Hands-on Performance Comparison of NVMe I/O Completion Modes

2026-1 Systems Technology
← Lab Overview | ← DPAS Introduction
한국어

Prerequisites

You must have already completed the platform-specific QEMU installation and Ubuntu VM setup.

Required Environment Ubuntu 22.04 VM + mainline kernel 5.18 (uname -r5.18.0-051800-generic)
Why Kernel 5.18? This lab uses synchronous Direct I/O polling via preadv2(RWF_HIPRI). This feature was removed starting from Linux kernel 5.19, so 5.18 is the last kernel that supports it. Since the QEMU Guest VM runs kernel 5.18, the polling lab works regardless of the host OS kernel version.

I/O Completion Background Reading → — History of Interrupt/Polling, FAST '26 latest research, the future of io_uring and polling
dpas.patch Improvement Process → — Three issues discovered in QEMU and the kernel debugging process

Part 1: DPAS Kernel Installation

Install the DPAS-patched kernel using the provided deb packages.

What is a deb package? A .deb file is a software installation package used in Ubuntu/Debian-based Linux distributions. It serves the same role as .msi installers on Windows or .pkg on macOS. Install with sudo dpkg -i file.deb, which automatically places system files such as kernel images and headers in the correct locations.
Architecture Selection — Based on CPU Architecture, Not Host OS Kernel packages are selected based on the CPU architecture the VM runs on, regardless of the host operating system (macOS/Windows/Linux). Run uname -m inside the Guest VM to check the architecture.
Host Environmentuname -m OutputRequired Package
Apple Silicon Mac (M1/M2/M3/M4)aarch64ARM64
Intel Mac / Linux PC / Windows (WSL2)x86_64x86_64

Download Kernel Packages

Download the package matching your VM's CPU architecture.

ARM64 Kernel Image (23MB) ARM64 Headers (8MB)
x86_64 Kernel Image (12MB) x86_64 Headers (8MB)

Advanced: Build the kernel yourself

Download dpas.patch

1-1. Transfer Packages

Transfer the downloaded deb files from the host to the VM.

# Run on the host terminal
scp -P 2222 linux-image-5.18.0-dpas_*.deb <username>@localhost:~/
scp -P 2222 linux-headers-5.18.0-dpas_*.deb <username>@localhost:~/

1-2. Install and Reboot

# Run inside the VM
sudo dpkg -i ~/linux-image-5.18.0-dpas_*.deb ~/linux-headers-5.18.0-dpas_*.deb
sudo reboot

1-3. Verify Installation

uname -r
# 5.18.0-dpas
If the boot kernel did not change GRUB may boot into the previous kernel. Select the DPAS kernel with the following command and reboot:
sudo grub-reboot "Advanced options for Ubuntu>Ubuntu, with Linux 5.18.0-dpas"
sudo reboot

Part 2: Lab — Performance Comparison by Mode

2-1. Benchmark Script

Use the unified benchmark script bench.sh. It selectively runs 5 I/O completion modes and measures IOPS, latency, CPU usage, and context switches for each mode across job counts (1,2,4,8,16).

Download bench.sh

Supported Modes

ModeDescriptionio_poll_delaypas_enabledhipri
INTInterrupt completion (no polling)-10No
CPClassic Polling (busy-poll)-10Yes
LHPLinux Hybrid Polling (adaptive sleep)00Yes
PASPAS (poll-after-sleep, adaptive)01Yes
DPASDynamic PAS (auto mode switching)01Yes

Key Features

Usage

# Transfer to VM
scp -P 2222 bench.sh <username>@localhost:~/

# Run inside VM
chmod +x bench.sh

# Syntax: sudo bash bench.sh [modes...] [R|W] [repeat_count]
sudo bash bench.sh ALL                  # All modes, randread, 1 run
sudo bash bench.sh INT CP DPAS W 3      # 3 modes, randwrite, 3 runs
sudo bash bench.sh DPAS 2               # DPAS only, randread, 2 runs
sudo bash bench.sh ALL W 2              # All modes, randwrite, 2 runs

# Arguments can be in any order:
sudo bash bench.sh INT 2 R CP           # same as: CP INT R 2

Sample Output

=== INT ===
Jobs         IOPS       lat_us     cpu%        ctx
----       ------       ------    -----     ------
1            7253        137.3      0.2     145087
2           16294        122.2      0.8     250851
...

=== DPAS ===
Jobs         IOPS       lat_us     cpu%        ctx   breakdown
----       ------       ------    -----     ------   ----------
1           18694         53.1     94.7      18058   cp:170035 pas:17100 ol:0 int:0
2           24119         82.1      2.3     398800   cp:1000 pas:800 ol:2801 int:236614
...

2-2. Recording Results

Record the IOPS from the script output.

Modej=1j=2j=4j=8j=16
INT
CP
LHP
PAS
DPAS

2-3. Verifying DPAS Mode Transition

Check the DPAS breakdown in the bench.sh output:

MODE ValueMeaning
0INT (Interrupt)
1CP (Continuous Polling)
2PAS (Initial State)
3OL (Overloaded)

Expected Behavior

JobsExpected Final ModeReason
1CP (MODE 1)QD=1 → PAS→CP transition, lowest latency via polling
2+INT (MODE 0)QD>1 → PAS→OL→INT transition, yields CPU via interrupts

2-4. Analysis Questions

  1. What causes the IOPS difference between INT and CP modes when j=1?
  2. Why does IOPS not increase significantly when adding more jobs in CP mode? (Consider CPU usage)
  3. Explain why DPAS approaches CP performance at j=1 and approaches INT performance at j≥4 from the perspective of mode transition.
  4. Why is the pas io count always less than the polled io count in the DPAS breakdown?
  5. Explain how the d_init value affects DPAS mode transitions.

Part 3: DPAS Kernel Internals

3-0. IO Path and DPAS Intervention Point

When fio submits IO, it traverses the kernel block layer to reach the NVMe driver. DPAS controls the completion waiting strategy in the polling path.

/* Userspace */
fio (pvsync2 --hipri)
  └─ preadv2(RWF_HIPRI)                   /* polling request flag */

/* Kernel: IO submission (block/fops.c) */
  └─ blk_mq_submit_bio()                  /* block layer entry */
       ├─ ★ INT→OL transition (int_cnt ≥ 10000) /* re-evaluate at submit */
       └─ nvme_queue_rq()                   /* dispatch to NVMe driver */

/* Kernel: IO completion wait (block/blk-mq.c) */
  └─ blk_mq_poll()                        /* polling entry point */
       └─ blk_mq_poll_hybrid()             /* ★ DPAS core: mode decision */
            ├─ blk_mq_poll_pas_nsecs()      /* adaptive sleep calc (dur, tf) */
            ├─ mode transition logic         /* CP↔PAS→OL→INT transitions */
            └─ hrtimer_nanosleep()           /* PAS: sleep then poll */
               or poll immediately (CP)
               or return → wait for interrupt (INT)
Source Location Mode transition logic is split across two files: block/blk-mq.c polling path (CP↔PAS→OL→INT) and block/fops.c submission path (INT→OL re-evaluation).

sysfs Interface

The DPAS kernel provides the following parameters under /sys/block/nvme0n1/queue/.

ParameterDescriptionDefault
switch_enabledEnable DPAS mode switching0
switch_statPer-CPU mode statistics output (read-only)
pas_enabledEnable PAS mode0
pas_adaptive_enabledEnable adaptive sleep0
switch_param1PAS→OL threshold (tf > param1). Paper setting: 00
switch_param2OL→PAS threshold (avg QD ≤ param2)10
switch_param3OL→INT threshold (avg QD > param3)10
switch_param4Enable PAS→CP transition (0/1)1
pas_d_initAdaptive sleep initial/minimum value (ns). bench.sh auto-sets to avg_lat/10100
QD 10x Multiplier QD-related parameters use a 10x scale. param2=10 means average QD ≤ 1.0. This design provides decimal precision using integer arithmetic.
Mode Transition Diagram See the visual diagram of DPAS's 4 modes (PAS/CP/OL/INT) and their transition conditions on the DPAS Introduction page.

3-1. Core Control Logic (block/blk-mq.c)

The code below is from blk_mq_poll_hybrid(), the DPAS mode transition logic. See how each sysfs parameter controls the transitions.

① PAS Evaluation (every 100 IOs)

/* Evaluate after collecting 100 IOs in PAS mode */
if(sc->mode == _PAS && sc->pas_cnt >= 100) {
    average_qd = sc->qd_sum * 10 / sc->pas_cnt;  /* 10x precision */

    if(sc->param4 >= 1 && average_qd == 10) {  /* ← switch_param4: enable PAS→CP */
        sc->mode = _CP;          /* QD=1.0 → polling is better */

    } else if(sc->tf > sc->param1) {            /* ← switch_param1: PAS→OL threshold */
        sc->mode = _OL;          /* timer failure → overloaded */

    } else {                                  /* stay in PAS */
        sc->pas_cnt = 0;
        sc->qd_sum = 0;
        sc->tf = 0;              /* tf resets every evaluation */
    }
}

② OL Evaluation (every 100 IOs)

/* Evaluate after collecting 100 IOs in OL mode */
if(sc->mode == _OL && sc->ol_cnt >= 100) {
    average_qd = sc->qd_sum * 10 / sc->ol_cnt;

    if(average_qd <= sc->param2) {             /* ← switch_param2: OL→PAS recovery */
        sc->mode = _PAS;         /* load decreased → retry sleep */

    } else if(average_qd > sc->param3) {       /* ← switch_param3: OL→INT escalation */
        sc->mode = _INT;         /* high load → interrupt mode */

    } else {                                  /* stay in OL */
        sc->ol_cnt = 0;
        sc->qd_sum = 0;
    }
}

③ Adaptive Sleep & Timer Failure

/* sr_pnlt: previous IO was oversleep(0) / undersleep(1)
   sr_last:  current IO was oversleep(0) / undersleep(1)
   case: sr_pnlt*2 + sr_last → 0,1,2,3 */
cur_case = stat.sr_pnlt * 2 + stat.sr_last;
switch(cur_case) {
    case 0: adj -= dn;                         /* over,over → decrease sleep */
    case 1: adj = div + up;                     /* over,under → slightly increase */
    case 2: adj = div - dn;                     /* under,over → slightly decrease */
    case 3: adj += up;                          /* under,under → increase sleep */
}

stat.dur = stat.dur * adj / div;              /* update sleep duration */

if(stat.dur < q->d_init) {                   /* ← pas_d_init: minimum sleep */
    stat.dur = q->d_init;                     /* floor clamp */
    sc->tf++;                                 /* timer failure incremented! */
}                                             /* → accumulated tf triggers PAS→OL */
Key Flow Adaptive sleep optimizes dur. Under high load, dur keeps decreasing until it hits the d_init floor → tf++ → when tf > param1, PAS→OL → if QD remains high, OL→INT switches to interrupt mode. Conversely, when QD=1, PAS→CP switches to polling for lowest latency.

Part 4: Building the Kernel from Source Challenge

Instead of using deb packages, apply the DPAS patch directly to the kernel source and build it yourself.

4-1. Install Build Dependencies

sudo apt install -y build-essential libncurses-dev bison flex libssl-dev \
    libelf-dev bc pahole dwarves zstd

4-2. Download Kernel Source

cd ~
wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.18.tar.xz
tar xf linux-5.18.tar.xz
cd linux-5.18

4-3. Apply the DPAS Patch

# After transferring the patch from host to VM
patch -p1 < ~/dpas.patch
Makefile Hunk Failure Since the patch is based on 5.18-rc6, one Makefile hunk will fail. This can be safely ignored.
Instead, set localversion manually: echo "-dpas" > localversion

4-4. Kernel Configuration (Minimized)

cp /boot/config-$(uname -r) .config
yes '' | make localmodconfig
Why localmodconfig? Building with the default config includes thousands of modules, leading to disk exhaustion (No space left on device). localmodconfig includes only currently loaded modules, drastically reducing build time and disk usage.

4-5. Build

make -j$(nproc) bindeb-pkg

Upon successful build, linux-image-*.deb and linux-headers-*.deb files will be generated in the home directory.

4-6. Install

sudo dpkg -i ~/linux-image-5.18.0-dpas_*.deb ~/linux-headers-5.18.0-dpas_*.deb
sudo reboot

Troubleshooting

SymptomCauseSolution
No space left on deviceBuild artifacts exceed disk spacemake clean, delete tarball, use localmodconfig
Makefile Hunk #1 FAILEDPatch version mismatchIgnore, use localversion instead
Kernel unchanged after rebootGRUB default boot kernelSelect DPAS kernel with grub-reboot
Multiple hunk failuresPatch re-applied to already-patched sourceRe-extract the source and try again
15GB+ required for buildInsufficient free disk spaceCheck df -h /, extend LVM or clean up files

Reference: Limitations of the QEMU Environment

ItemQEMUReal Hardware
CP vs INT comparisonValid (1.7-1.8x IOPS difference)Valid
PAS adaptive sleepInaccurate timers, tf noiseAccurate
DPAS mode transitionWorks, but absolute performance is for reference onlyWorks with default parameters
Multi-core scalingDistorted by emulation overheadNear-linear
WBT (Write Back Throttling)Unsupported → QUEUE_FLAG_STATS inactiveActive (wbt_lat_usec=2000)
Key Takeaway Focus on the relative differences between modes and the transition behavior rather than absolute performance numbers. The goal is to understand the principles behind I/O completion mechanisms.

View dpas.patch improvement process → — Describes issues found in QEMU, their fixes, and a case study on using AI tools (Claude Code) for kernel debugging. The AI proposed correct fixes but gave an inaccurate explanation of the root cause, which the researcher identified and verified. Consider how to properly leverage AI tools in systems research.