Hands-on Performance Comparison of NVMe I/O Completion Modes
2026-1 Systems TechnologyYou must have already completed the platform-specific QEMU installation and Ubuntu VM setup.
uname -r → 5.18.0-051800-generic)
preadv2(RWF_HIPRI).
This feature was removed starting from Linux kernel 5.19, so 5.18 is the last kernel that supports it.
Since the QEMU Guest VM runs kernel 5.18, the polling lab works regardless of the host OS kernel version.
Install the DPAS-patched kernel using the provided deb packages.
.deb file is a software installation package used in Ubuntu/Debian-based Linux distributions.
It serves the same role as .msi installers on Windows or .pkg on macOS.
Install with sudo dpkg -i file.deb, which automatically places system files such as kernel images and headers in the correct locations.
uname -m inside the Guest VM to check the architecture.
| Host Environment | uname -m Output | Required Package |
|---|---|---|
| Apple Silicon Mac (M1/M2/M3/M4) | aarch64 | ARM64 |
| Intel Mac / Linux PC / Windows (WSL2) | x86_64 | x86_64 |
Download the package matching your VM's CPU architecture.
Advanced: Build the kernel yourself
Download dpas.patchTransfer the downloaded deb files from the host to the VM.
# Run on the host terminal
scp -P 2222 linux-image-5.18.0-dpas_*.deb <username>@localhost:~/
scp -P 2222 linux-headers-5.18.0-dpas_*.deb <username>@localhost:~/
# Run inside the VM
sudo dpkg -i ~/linux-image-5.18.0-dpas_*.deb ~/linux-headers-5.18.0-dpas_*.deb
sudo reboot
uname -r
# 5.18.0-dpas
sudo grub-reboot "Advanced options for Ubuntu>Ubuntu, with Linux 5.18.0-dpas"sudo reboot
Use the unified benchmark script bench.sh. It selectively runs 5 I/O completion modes and measures IOPS, latency, CPU usage, and context switches for each mode across job counts (1,2,4,8,16).
| Mode | Description | io_poll_delay | pas_enabled | hipri |
|---|---|---|---|---|
INT | Interrupt completion (no polling) | -1 | 0 | No |
CP | Classic Polling (busy-poll) | -1 | 0 | Yes |
LHP | Linux Hybrid Polling (adaptive sleep) | 0 | 0 | Yes |
PAS | PAS (poll-after-sleep, adaptive) | 0 | 1 | Yes |
DPAS | Dynamic PAS (auto mode switching) | 0 | 1 | Yes |
taskset -c 0)modprobe nvme reload between each mode/jobpoll_queues=jobs (capped at nproc)param1=0, param2=10, param3=10, param4=1# Transfer to VM
scp -P 2222 bench.sh <username>@localhost:~/
# Run inside VM
chmod +x bench.sh
# Syntax: sudo bash bench.sh [modes...] [R|W] [repeat_count]
sudo bash bench.sh ALL # All modes, randread, 1 run
sudo bash bench.sh INT CP DPAS W 3 # 3 modes, randwrite, 3 runs
sudo bash bench.sh DPAS 2 # DPAS only, randread, 2 runs
sudo bash bench.sh ALL W 2 # All modes, randwrite, 2 runs
# Arguments can be in any order:
sudo bash bench.sh INT 2 R CP # same as: CP INT R 2
=== INT ===
Jobs IOPS lat_us cpu% ctx
---- ------ ------ ----- ------
1 7253 137.3 0.2 145087
2 16294 122.2 0.8 250851
...
=== DPAS ===
Jobs IOPS lat_us cpu% ctx breakdown
---- ------ ------ ----- ------ ----------
1 18694 53.1 94.7 18058 cp:170035 pas:17100 ol:0 int:0
2 24119 82.1 2.3 398800 cp:1000 pas:800 ol:2801 int:236614
...
Record the IOPS from the script output.
| Mode | j=1 | j=2 | j=4 | j=8 | j=16 |
|---|---|---|---|---|---|
| INT | |||||
| CP | |||||
| LHP | |||||
| PAS | |||||
| DPAS |
Check the DPAS breakdown in the bench.sh output:
| MODE Value | Meaning |
|---|---|
| 0 | INT (Interrupt) |
| 1 | CP (Continuous Polling) |
| 2 | PAS (Initial State) |
| 3 | OL (Overloaded) |
| Jobs | Expected Final Mode | Reason |
|---|---|---|
| 1 | CP (MODE 1) | QD=1 → PAS→CP transition, lowest latency via polling |
| 2+ | INT (MODE 0) | QD>1 → PAS→OL→INT transition, yields CPU via interrupts |
pas io count always less than the polled io count in the DPAS breakdown?d_init value affects DPAS mode transitions.When fio submits IO, it traverses the kernel block layer to reach the NVMe driver. DPAS controls the completion waiting strategy in the polling path.
/* Userspace */
fio (pvsync2 --hipri)
└─ preadv2(RWF_HIPRI) /* polling request flag */
/* Kernel: IO submission (block/fops.c) */
└─ blk_mq_submit_bio() /* block layer entry */
├─ ★ INT→OL transition (int_cnt ≥ 10000) /* re-evaluate at submit */
└─ nvme_queue_rq() /* dispatch to NVMe driver */
/* Kernel: IO completion wait (block/blk-mq.c) */
└─ blk_mq_poll() /* polling entry point */
└─ blk_mq_poll_hybrid() /* ★ DPAS core: mode decision */
├─ blk_mq_poll_pas_nsecs() /* adaptive sleep calc (dur, tf) */
├─ mode transition logic /* CP↔PAS→OL→INT transitions */
└─ hrtimer_nanosleep() /* PAS: sleep then poll */
or poll immediately (CP)
or return → wait for interrupt (INT)
block/blk-mq.c polling path (CP↔PAS→OL→INT) and
block/fops.c submission path (INT→OL re-evaluation).
The DPAS kernel provides the following parameters under /sys/block/nvme0n1/queue/.
| Parameter | Description | Default |
|---|---|---|
switch_enabled | Enable DPAS mode switching | 0 |
switch_stat | Per-CPU mode statistics output (read-only) | — |
pas_enabled | Enable PAS mode | 0 |
pas_adaptive_enabled | Enable adaptive sleep | 0 |
switch_param1 | PAS→OL threshold (tf > param1). Paper setting: 0 | 0 |
switch_param2 | OL→PAS threshold (avg QD ≤ param2) | 10 |
switch_param3 | OL→INT threshold (avg QD > param3) | 10 |
switch_param4 | Enable PAS→CP transition (0/1) | 1 |
pas_d_init | Adaptive sleep initial/minimum value (ns). bench.sh auto-sets to avg_lat/10 | 100 |
param2=10 means average QD ≤ 1.0.
This design provides decimal precision using integer arithmetic.
The code below is from blk_mq_poll_hybrid(), the DPAS mode transition logic.
See how each sysfs parameter controls the transitions.
/* Evaluate after collecting 100 IOs in PAS mode */
if(sc->mode == _PAS && sc->pas_cnt >= 100) {
average_qd = sc->qd_sum * 10 / sc->pas_cnt; /* 10x precision */
if(sc->param4 >= 1 && average_qd == 10) { /* ← switch_param4: enable PAS→CP */
sc->mode = _CP; /* QD=1.0 → polling is better */
} else if(sc->tf > sc->param1) { /* ← switch_param1: PAS→OL threshold */
sc->mode = _OL; /* timer failure → overloaded */
} else { /* stay in PAS */
sc->pas_cnt = 0;
sc->qd_sum = 0;
sc->tf = 0; /* tf resets every evaluation */
}
}
/* Evaluate after collecting 100 IOs in OL mode */
if(sc->mode == _OL && sc->ol_cnt >= 100) {
average_qd = sc->qd_sum * 10 / sc->ol_cnt;
if(average_qd <= sc->param2) { /* ← switch_param2: OL→PAS recovery */
sc->mode = _PAS; /* load decreased → retry sleep */
} else if(average_qd > sc->param3) { /* ← switch_param3: OL→INT escalation */
sc->mode = _INT; /* high load → interrupt mode */
} else { /* stay in OL */
sc->ol_cnt = 0;
sc->qd_sum = 0;
}
}
/* sr_pnlt: previous IO was oversleep(0) / undersleep(1)
sr_last: current IO was oversleep(0) / undersleep(1)
case: sr_pnlt*2 + sr_last → 0,1,2,3 */
cur_case = stat.sr_pnlt * 2 + stat.sr_last;
switch(cur_case) {
case 0: adj -= dn; /* over,over → decrease sleep */
case 1: adj = div + up; /* over,under → slightly increase */
case 2: adj = div - dn; /* under,over → slightly decrease */
case 3: adj += up; /* under,under → increase sleep */
}
stat.dur = stat.dur * adj / div; /* update sleep duration */
if(stat.dur < q->d_init) { /* ← pas_d_init: minimum sleep */
stat.dur = q->d_init; /* floor clamp */
sc->tf++; /* timer failure incremented! */
} /* → accumulated tf triggers PAS→OL */
dur. Under high load, dur keeps decreasing until it hits
the d_init floor → tf++ → when tf > param1, PAS→OL →
if QD remains high, OL→INT switches to interrupt mode.
Conversely, when QD=1, PAS→CP switches to polling for lowest latency.
Instead of using deb packages, apply the DPAS patch directly to the kernel source and build it yourself.
sudo apt install -y build-essential libncurses-dev bison flex libssl-dev \
libelf-dev bc pahole dwarves zstd
cd ~
wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.18.tar.xz
tar xf linux-5.18.tar.xz
cd linux-5.18
# After transferring the patch from host to VM
patch -p1 < ~/dpas.patch
echo "-dpas" > localversion
cp /boot/config-$(uname -r) .config
yes '' | make localmodconfig
No space left on device).
localmodconfig includes only currently loaded modules, drastically reducing build time and disk usage.
make -j$(nproc) bindeb-pkg
Upon successful build, linux-image-*.deb and linux-headers-*.deb files will be generated in the home directory.
sudo dpkg -i ~/linux-image-5.18.0-dpas_*.deb ~/linux-headers-5.18.0-dpas_*.deb
sudo reboot
| Symptom | Cause | Solution |
|---|---|---|
No space left on device | Build artifacts exceed disk space | make clean, delete tarball, use localmodconfig |
Makefile Hunk #1 FAILED | Patch version mismatch | Ignore, use localversion instead |
| Kernel unchanged after reboot | GRUB default boot kernel | Select DPAS kernel with grub-reboot |
| Multiple hunk failures | Patch re-applied to already-patched source | Re-extract the source and try again |
| 15GB+ required for build | Insufficient free disk space | Check df -h /, extend LVM or clean up files |
| Item | QEMU | Real Hardware |
|---|---|---|
| CP vs INT comparison | Valid (1.7-1.8x IOPS difference) | Valid |
| PAS adaptive sleep | Inaccurate timers, tf noise | Accurate |
| DPAS mode transition | Works, but absolute performance is for reference only | Works with default parameters |
| Multi-core scaling | Distorted by emulation overhead | Near-linear |
| WBT (Write Back Throttling) | Unsupported → QUEUE_FLAG_STATS inactive | Active (wbt_lat_usec=2000) |