From Interrupts to Polling, and Adaptive Techniques
Background ReadingWhen an application reads from or writes to a disk, how it gets notified of completion after submitting a request is what we call I/O completion.
In the HDD era, a single disk I/O took several milliseconds (ms), so differences of a few microseconds (us) due to the completion notification method were negligible. However, modern NVMe SSDs process a single I/O in 10 to 100 microseconds (us).
| Device | I/O latency | Completion overhead ratio |
|---|---|---|
| HDD | ~5,000 us | <0.1% (negligible) |
| SATA SSD | ~100 us | ~5% |
| NVMe SSD (TLC) | ~15 us | ~30% |
| Intel Optane (3D XPoint) | ~10 us | ~50% |
As devices get faster, the proportion of software overhead in the total I/O time increases dramatically. Optimizing the I/O completion method is now synonymous with optimizing overall I/O performance.
After submitting an I/O request, the CPU is released and the thread sleeps
The device wakes it up via an interrupt upon completion
CPU efficient
High wake-up cost
After submitting an I/O request, continuously checks (busy-wait)
Can detect completion immediately
Lowest latency
100% CPU utilization
Sleeps for a period, then starts polling
Dynamically switches modes based on conditions
Balance between latency and CPU
Introduction of the poll_queues parameter in the NVMe driver. Dedicated hardware queues separated for polling.
Addition of the blk_poll() API to the block layer. The io_poll sysfs interface appeared.
Hybrid polling introduced (io_poll_delay). Sleeps for a set duration before starting to poll.
io_uring appeared. The new standard for asynchronous I/O. Polling support via the IORING_SETUP_IOPOLL flag.
The last period when preadv2(RWF_HIPRI)-based synchronous Direct I/O polling worked reliably.
Synchronous DIO polling removed. Polling transitioned to io_uring only. The sync polling path via RWF_HIPRI was deleted.
io_uring established as the primary interface for polling. HP (Hewlett Packard Enterprise) adopted an io_uring-based storage stack in their products, increasing industry interest.
preadv2 + RWF_HIPRI) had few users relative to the kernel maintenance burden.
As io_uring provided a more efficient polling interface, the redundant path was removed as part of cleanup.
However, this decision did not mean "polling is unnecessary" — rather, it meant "the polling interface was consolidated into io_uring."
At USENIX FAST '26 (File and Storage Technologies conference), held in February 2026, multiple papers on I/O completion optimization were presented, demonstrating that research in this area is once again thriving.
"DPAS: A Prompt, Accurate and Safe I/O Completion Method for SSDs"
"UnICom: A Universally High-Performant I/O Completion Mechanism for Modern Computer Systems"
| Aspect | DPAS | UnICom |
|---|---|---|
| Mode decision signal | I/O queue depth (qd) — direct counting of I/O activity | Run queue task count — indirect estimation of CPU contention |
| Polling entity | The thread that issued the I/O itself | Dedicated completion thread (on another core) |
| Kernel modification scope | Block layer + NVMe driver | Scheduler + separate kernel module |
| Single I/O + multiple C-threads | qd=1 → CP (optimal) | task>1 → TagSched-TagPoll (unnecessary overhead) |
switch_stat sysfs interface.
io_uring, introduced in Linux 5.0, has now established itself as a core Linux I/O interface.
io_uring is an asynchronous I/O framework that places shared ring buffers (Submission Queue, Completion Queue) between the kernel and userspace, allowing I/O requests to be submitted and completions to be checked without system calls.
Traditional: App → syscall → kernel → device → interrupt → kernel → syscall return → App
io_uring: App → write to SQ → kernel processes → write to CQ → App checks CQ
(minimized syscalls, batch processing possible)
io_uring's IORING_SETUP_IOPOLL mode checks completion via polling.
After synchronous DIO polling was removed in 5.19, io_uring became the only official interface for polling.
If adaptive completion techniques like DPAS are integrated into io_uring's polling path, it would become possible at the kernel level to automatically select the optimal completion mode based on the workload. This is being discussed as a promising path toward kernel mainline adoption.
The reasons for using kernel 5.18 in this lab are as follows:
preadv2(RWF_HIPRI) works. Starting from 5.19, this path was removed.Guest (kernel 5.18) Host (kernel 5.19+, macOS, Windows...)
───────────────── ──────────────────────────────────────
App: preadv2(RWF_HIPRI)
│
Guest NVMe driver: sets REQ_POLLED
│
Guest: poll CQ (busy-wait) ◄── completed within guest
│ ▲
Virtual NVMe CQ ◄──── QEMU writes CQ entry ◄── Host I/O completed
Documentation/block/queue-sysfs.rst (io_poll, io_poll_delay)io_uring(7) man page, Jens Axboe's "Efficient IO with io_uring"drivers/nvme/host/pci.c (nvme module parameter)