Background: The Evolution of I/O Completion — Systems Advanced Technology 2026

1. The Problem: Why Does I/O Completion Matter?

When an application reads from or writes to a disk, how it gets notified of completion after submitting a request is what we call I/O completion.

In the HDD era, a single disk I/O took several milliseconds (ms), so differences of a few microseconds (us) due to the completion notification method were negligible. However, modern NVMe SSDs process a single I/O in 10 to 100 microseconds (us).

Key Numbers

Device	I/O latency	Completion overhead ratio
HDD	~5,000 us	<0.1% (negligible)
SATA SSD	~100 us	~5%
NVMe SSD (TLC)	~15 us	~30%
Intel Optane (3D XPoint)	~10 us	~50%

As devices get faster, the proportion of software overhead in the total I/O time increases dramatically. Optimizing the I/O completion method is now synonymous with optimizing overall I/O performance.

2. Three I/O Completion Modes

Interrupt (INT)

After submitting an I/O request, the CPU is released and the thread sleeps

The device wakes it up via an interrupt upon completion

CPU efficient

High wake-up cost

Continuous Polling (CP)

After submitting an I/O request, continuously checks (busy-wait)

Can detect completion immediately

Lowest latency

100% CPU utilization

Hybrid Polling (PAS/DPAS)

Sleeps for a period, then starts polling

Dynamically switches modes based on conditions

Balance between latency and CPU

Why Can't We Just Use One?

Interrupt is efficient when there are many I/O requests, but when there is only 1 request, the wake-up cost (context switch) is unnecessarily high.
Polling is optimal when there is only 1 I/O request, but wastes CPU when there are many requests.
In real workloads, I/O patterns change constantly — therefore, adaptive techniques that switch modes based on the situation are needed.

3. History of Polling Support in the Linux Kernel

2015 (Linux 4.4)

Introduction of the poll_queues parameter in the NVMe driver. Dedicated hardware queues separated for polling.

2017 (Linux 4.12)

Addition of the blk_poll() API to the block layer. The io_poll sysfs interface appeared.

2018 (Linux 4.19)

Hybrid polling introduced (io_poll_delay). Sleeps for a set duration before starting to poll.

2019 (Linux 5.0)

io_uring appeared. The new standard for asynchronous I/O. Polling support via the IORING_SETUP_IOPOLL flag.

2021 (Linux 5.13)

The last period when preadv2(RWF_HIPRI)-based synchronous Direct I/O polling worked reliably.

2022 (Linux 5.19)

Synchronous DIO polling removed. Polling transitioned to io_uring only. The sync polling path via RWF_HIPRI was deleted.

2024 (Linux 6.x)

io_uring established as the primary interface for polling. HP (Hewlett Packard Enterprise) adopted an io_uring-based storage stack in their products, increasing industry interest.

Why Was It Removed in 5.19? Synchronous DIO polling (preadv2 + RWF_HIPRI) had few users relative to the kernel maintenance burden. As io_uring provided a more efficient polling interface, the redundant path was removed as part of cleanup. However, this decision did not mean "polling is unnecessary" — rather, it meant "the polling interface was consolidated into io_uring."

4. USENIX FAST '26: The Revival of I/O Completion Research

At USENIX FAST '26 (File and Storage Technologies conference), held in February 2026, multiple papers on I/O completion optimization were presented, demonstrating that research in this area is once again thriving.

DPAS (Seo et al.)

"DPAS: A Prompt, Accurate and Safe I/O Completion Method for SSDs"

PAS (Polling After Sleep): A hybrid polling approach that tracks the latency of the two most recent I/Os to precisely adjust the sleep duration
DPAS (Dynamic PAS): Dynamically switches between polling, interrupt, and PAS depending on CPU load conditions
Key metric: hrtimer overlap count (qd) — directly counts the number of in-flight I/Os on the current CPU to determine the mode
9% YCSB performance improvement and 21%p CPU usage reduction on 3D XPoint SSDs

UnICom (Pan et al.)

"UnICom: A Universally High-Performant I/O Completion Mechanism for Modern Computer Systems"

A dedicated in-kernel completion thread centrally polls I/O from multiple processes
TagSched: Adds a 2-bit tag to the scheduler to minimize sleep/wake overhead for I/O-waiting threads
SKIP: Kernel-supported direct SSD access that eliminates the complexity of userspace permission management
Equal or better performance across all scenarios compared to ext4, BypassD, and io_uring

Key Differences Between the Two Approaches

Aspect	DPAS	UnICom
Mode decision signal	I/O queue depth (qd) — direct counting of I/O activity	Run queue task count — indirect estimation of CPU contention
Polling entity	The thread that issued the I/O itself	Dedicated completion thread (on another core)
Kernel modification scope	Block layer + NVMe driver	Scheduler + separate kernel module
Single I/O + multiple C-threads	qd=1 → CP (optimal)	task>1 → TagSched-TagPoll (unnecessary overhead)

What You Will Experience in This Lab You will directly observe the mode transitions of DPAS. With a single job (QD=1), you can see the transition from PAS to CP, and with multiple jobs (QD>1), the transition from PAS to OL to INT, using the switch_stat sysfs interface.

5. io_uring and the Future of Polling

io_uring, introduced in Linux 5.0, has now established itself as a core Linux I/O interface.

What is io_uring?

io_uring is an asynchronous I/O framework that places shared ring buffers (Submission Queue, Completion Queue) between the kernel and userspace, allowing I/O requests to be submitted and completions to be checked without system calls.

Traditional:  App → syscall → kernel → device → interrupt → kernel → syscall return → App

io_uring:     App → write to SQ → kernel processes → write to CQ → App checks CQ
              (minimized syscalls, batch processing possible)

Industry Adoption Trends

HP (Hewlett Packard Enterprise): Adopted io_uring-based storage stack in enterprise products
Meta (Facebook): Uses io_uring as the core I/O path in internal storage services
ScyllaDB, RocksDB: Adopted io_uring backend as the default I/O engine

Relationship with Polling

io_uring's IORING_SETUP_IOPOLL mode checks completion via polling. After synchronous DIO polling was removed in 5.19, io_uring became the only official interface for polling.

If adaptive completion techniques like DPAS are integrated into io_uring's polling path, it would become possible at the kernel level to automatically select the optimal completion mode based on the workload. This is being discussed as a promising path toward kernel mainline adoption.

6. Why Kernel 5.18?

The reasons for using kernel 5.18 in this lab are as follows:

Synchronous DIO polling support: This is the last kernel where sync polling via preadv2(RWF_HIPRI) works. Starting from 5.19, this path was removed.
DPAS patch compatibility: DPAS was developed based on the 5.18 block layer, and the patch applies cleanly on this kernel.
QEMU guest independence: Since we use kernel 5.18 inside a VM, the lab can be conducted regardless of the host OS kernel version. The same experimental environment can be set up on macOS, Windows, or the latest Linux.

Guest (kernel 5.18)                    Host (kernel 5.19+, macOS, Windows...)
─────────────────                      ──────────────────────────────────────
App: preadv2(RWF_HIPRI)
  │
Guest NVMe driver: sets REQ_POLLED
  │
Guest: poll CQ (busy-wait)  ◄── completed within guest
  │                                    ▲
Virtual NVMe CQ ◄──── QEMU writes CQ entry ◄── Host I/O completed

Summary 5.18 is both "the last kernel where sync DIO polling works" and "the kernel where the DPAS patch can be applied." Since it runs inside a QEMU VM, polling experiments can be conducted regardless of the host environment.

7. Further Reading

Papers

D. Seo et al., "DPAS: A Prompt, Accurate and Safe I/O Completion Method for SSDs", USENIX FAST '26
R. Pan et al., "UnICom: A Universally High-Performant I/O Completion Mechanism for Modern Computer Systems", USENIX FAST '26

Kernel Documentation and Code

Linux block layer polling: Documentation/block/queue-sysfs.rst (io_poll, io_poll_delay)
io_uring: io_uring(7) man page, Jens Axboe's "Efficient IO with io_uring"
NVMe poll_queues: drivers/nvme/host/pci.c (nvme module parameter)

Background Knowledge

Interrupt vs Polling: Operating Systems: Three Easy Pieces (OSTEP) — I/O Devices chapter
NVMe architecture: NVM Express Base Specification
io_uring introduction: Lord of the io_uring