Gaurav Sarma

What is io_uring?

io_uring is a high-performance asynchronous I/O (Input/Output) interface introduced in the Linux kernel (version 5.1).

Its primary goal is to overcome the performance bottlenecks of older I/O methods (like read()/write() and the older Linux AIO) by drastically reducing the overhead caused by frequent system calls and memory copying.

How it Works

The core of io_uring is built around two lockless ring buffers that are mapped into shared memory between the user application and the kernel: the Submission Queue (SQ), where the application writes I/O requests (SQEs), and the Completion Queue (CQ), where the kernel writes the results of completed I/O operations (CQEs) asynchronously. This design allows applications to queue multiple requests in a single batch, amortizing the cost of a single system call (io_uring_enter) over many operations.

Key Benefits

The system achieves its high speed by offering features that result in minimal system calls and zero-copy capability. By using shared rings, applications can queue I/O requests and reap completions with far fewer (or even zero, in polling mode) jumps between user space and kernel space. Furthermore, it supports features like Registered Buffers, which allows the storage hardware to use Direct Memory Access (DMA) to place data directly into the application's memory. This eliminates costly data copying via the kernel. It also provides a unified API for a wide range of operations, including file I/O, network I/O (sockets), and various control operations.

In essence, io_uring provides a complete, modern, and highly efficient way for high-throughput applications (like databases and web servers) to maximize the speed of modern hardware. The key to io_uring's performance boost, especially for databases talking to lightning-fast storage like NVMe SSDs, is simple: We eliminate the major friction points created by the traditional Linux I/O stack.

Submitting Tasks (SQPoll)

Normally, when your database wants to start reading or writing data, it has to execute a system call. That's a performance penalty because the CPU has to jump from your application's user space into the operating system's kernel space just to hand over the job.

io_uring avoids this with SQPoll (Submission Queue Polling). We set up a dedicated helper thread running inside the kernel that does nothing but constantly check the submission ring buffer. Your database simply drops its I/O request into this shared memory queue. Because the kernel thread is always looking, it picks up the request instantly, and your application never has to waste time on a system call just to start the I/O.

Completing Tasks (IOPoll)

When the NVMe drive finishes a data transfer, the standard way it communicates is by sending an interrupt to the CPU. The CPU has to stop what it's doing, save its state, handle the interrupt, and then resume. This interrupt overhead adds noticeable latency, particularly under high load.

With IOPoll (I/O Polling), this goes away. Instead of waiting for the NVMe device to interrupt the CPU, the system (or a special kernel thread) actively and continuously checks the hardware's completion queue. This constant polling bypasses the interrupt mechanism entirely. While this is often used in specialized scenarios where the application is talking close to the hardware, it's a huge win for cutting down latency on I/O completion.

Eliminating Data Copying (Registered Buffers and DMA)

This is the big game-changer for moving massive amounts of data. When your database needs a chunk of data, the old process required two copies: first, the data moved from the NVMe device into the kernel's memory space, and second, it was copied again from the kernel's space into your database application's memory space.

io_uring solves this with Registered Buffers and Direct Memory Access (DMA). Your application tells the kernel, right at the start, exactly which specific memory regions it will use for I/O. Since the kernel has this map, it can instruct the NVMe controller to use DMA, allowing the hardware to pump the data directly from the drive into the application's pre-registered memory locations. This completely eliminates the costly intermediate copy and the overhead associated with memory page management, resulting in maximum throughput.

References

https://arxiv.org/html/2512.04859v1