The Event Poll Interface

Recognizing the limitations of both poll( ) and select( ), the 2.6 Linux kernel^[15] introduced the event poll (epoll) facility. While more complex than the two earlier interfaces, epoll solves the fundamental performance problem shared by both of them, and adds several new features.

Both poll( ) and select( ) (discussed in Chapter 2) require the full list of file descriptors to watch on each invocation. The kernel must then walk the list of each file descriptor to be monitored. When this list grows large—it may contain hundreds or even thousands of file descriptors—walking the list on each invocation becomes a scalability bottleneck.

Epoll circumvents this problem by decoupling the monitor registration from the actual monitoring. One system call initializes an epoll context, another adds monitored file descriptors to or removes them from the context, and a third performs the actual event wait.

Creating a New Epoll Instance

An epoll context is created via epoll_create( ):

#include 

int epoll_create (int size)

A successful call to epoll_create( ) instantiates a new epoll instance, and returns a file descriptor associated with the instance. This file descriptor has no relationship to a real file; it is just a handle to be used with subsequent calls using the epoll facility. The size parameter is a hint to the kernel about the number of file descriptors that are going to be monitored; it is not the maximum number. Passing in a good approximation will result in better performance, but the exact number is not required. On error, the call returns −1, and sets errno to one of the following:

EINVAL: The size parameter is not a positive number.
ENFILE: The system has reached the limit on the total number of open files.
ENOMEM: Insufficient memory was available to complete the operation.

A typical call is:

int epfd;

epfd = epoll_create (100);  /* plan to watch ~100 fds */
if (epfd < 0)
        perror ("epoll_create");

The file descriptor returned from epoll_create( ) should be destroyed via a call to close( ) after polling is finished.

Controlling Epoll

The epoll_ctl( ) system call can be used to add file descriptors to and remove file descriptors from a given epoll context:

#include 

int epoll_ctl (int epfd,
               int op,
               int fd,
               struct epoll_event *event);

The header defines the epoll_event structure as:

struct epoll_event {
        _  _u32 events;  /* events */
        union {
                void *ptr;
                int fd;
                _  _u32 u32;
                _  _u64 u64;
        } data;
};

A successful call to epoll_ctl( ) controls the epoll instance associated with the file descriptor epfd. The parameter op specifies the operation to be taken against the file associated with fd. The event parameter further describes the behavior of the operation.

Here are valid values for the op parameter:

EPOLL_CTL_ADD: Add a monitor on the file associated with the file descriptor fd to the epoll instance associated with epfd, per the events defined in event.
EPOLL_CTL_DEL: Remove a monitor on the file associated with the file descriptor fd from the epoll instance associated with epfd.
EPOLL_CTL_MOD: Modify an existing monitor of fd with the updated events specified by event.

The events field in the epoll_event structure lists which events to monitor on the given file descriptor. Multiple events can be bitwise-ORed together. Here are valid values:

EPOLLERR: An error condition occurred on the file. This event is always monitored, even if it's not specified.
EPOLLET: Enables edge-triggered behavior for the monitor of the file (see the upcoming section "Edge- Versus Level-Triggered Events"). The default behavior is level-triggered.
EPOLLHUP: A hangup occurred on the file. This event is always monitored, even if it's not specified.
EPOLLIN: The file is available to be read from without blocking.
EPOLLONESHOT: After an event is generated and read, the file is automatically no longer monitored. A new event mask must be specified via EPOLL_CTL_MOD to reenable the watch.
EPOLLOUT: The file is available to be written to without blocking.
EPOLLPRI: There is urgent out-of-band data available to read.

The data field inside the event_poll structure is for the user's private use. The contents are returned to the user upon receipt of the requested event. The common practice is to set event.data.fd to fd, which makes it easy to look up which file descriptor caused the event.

Upon success, epoll_ctl( ) returns 0. On failure, the call returns −1, and sets errno to one of the following values:

EBADF: epfd is not a valid epoll instance, or fd is not a valid file descriptor.
EEXIST: op was EPOLL_CTL_ADD, but fd is already associated with epfd.
EINVAL: epfd is not an epoll instance, epfd is the same as fd, or op is invalid.
ENOENT: op was EPOLL_CTL_MOD, or EPOLL_CTL_DEL, but fd is not associated with epfd.
ENOMEM: There was insufficient memory to process the request.
EPERM: fd does not support epoll.

As an example, to add a new watch on the file associated with fd to the epoll instance epfd, you would write:

struct epoll_event event;
int ret;

event.data.fd = fd; /* return the fd to us later */
event.events = EPOLLIN | EPOLLOUT;

ret = epoll_ctl (epfd, EPOLL_CTL_ADD, fd, &event);
if (ret)
        perror ("epoll_ctl");

To modify an existing event on the file associated with fd on the epoll instance epfd, you would write:

struct epoll_event event;
int ret;

event.data.fd = fd; /* return the fd to us later */
event.events = EPOLLIN;

ret = epoll_ctl (epfd, EPOLL_CTL_MOD, fd, &event);
if (ret)
        perror ("epoll_ctl");

Conversely, to remove an existing event on the file associated with fd from the epoll instance epfd, you would write:

struct epoll_event event;
int ret;

ret = epoll_ctl (epfd, EPOLL_CTL_DEL, fd, &event);
if (ret)
        perror ("epoll_ctl");

Note that the event parameter can be NULL when op is EPOLL_CTL_DEL, as there is no event mask to provide. Kernel versions before 2.6.9, however, erroneously check for this parameter to be non-NULL. For portability to these older kernels, you should pass in a valid non-NULL pointer; it will not be touched. Kernel 2.6.9 fixed this bug.

Waiting for Events with Epoll

The system call epoll_wait( ) waits for events on the file descriptors associated with the given epoll instance:

#include 

int epoll_wait (int epfd,
                struct epoll_event *events,
                int maxevents,
                int timeout);

A call to epoll_wait( ) waits up to timeout milliseconds for events on the files associated with the epoll instance epfd. Upon success, events points to memory containing epoll_event structures describing each event, up to a maximum of maxevents events. The return value is the number of events, or −1 on error, in which case errno is set to one of the following:

EBADF: epfd is not a valid file descriptor.
EFAULT: The process does not have write access to the memory pointed at by events.
EINTR: The system call was interrupted by a signal before it could complete.
EINVAL: epfd is not a valid epoll instance, or maxevents is equal to or less than 0.

If timeout is 0, the call returns immediately, even if no events are available, in which case the call will return 0. If the timeout is −1, the call will not return until an event is available.

When the call returns, the events field of the epoll_event structure describes the events that occurred. The data field contains whatever the user set it to before invocation of epoll_ctl( ).

A full epoll_wait( ) example looks like this:

#define MAX_EVENTS    64

struct epoll_event *events;
int nr_events, i, epfd;

events = malloc (sizeof (struct epoll_event) * MAX_EVENTS);
if (!events) {
        perror ("malloc");
        return 1;
}

nr_events = epoll_wait (epfd, events, MAX_EVENTS, −1);
if (nr_events < 0) {
        perror ("epoll_wait");
        free (events);
        return 1;
}

for (i = 0; i < nr_events; i++) {
        printf ("event=%ld on fd=%d\n",
                events[i].events,
                events[i].data.fd);

        /*
         * We now can, per events[i].events, operate on
         * events[i].data.fd without blocking.
         */
}

free (events);

We will cover the functions malloc( ) and free( ) in Chapter 8.

Edge- Versus Level-Triggered Events

If the EPOLLET value is set in the events field of the event parameter passed to epoll_ctl( ), the watch on fd is edge-triggered, as opposed to level-triggered.

Consider the following events between a producer and a consumer communicating over a Unix pipe:

The producer writes 1 KB of data onto a pipe.
The consumer performs an epoll_wait( ) on the pipe, waiting for the pipe to contain data, and thus be readable.

With a level-triggered watch, the call to epoll_wait( ) in step 2 will return immediately, showing that the pipe is ready to read. With an edge-triggered watch, this call will not return until after step 1 occurs. That is, even if the pipe is readable at the invocation of epoll_wait( ), the call will not return until the data is written onto the pipe.

Level-triggered is the default behavior. It is how poll( ) and select( ) behave, and it is what most developers expect. Edge-triggered behavior requires a different approach to programming, commonly utilizing nonblocking I/O, and careful checking for EAGAIN.

Tip

The terminology comes from electrical engineering. A level-triggered interrupt is issued whenever a line is asserted. An edge-triggered interrupt is caused only during the rising or falling edge of the change in assertion. Level-triggered interrupts are useful when the state of the event (the asserted line) is of interest. Edge-triggered interrupts are useful when the event itself (the line being asserted) is of interest.

^[15]^* Epoll was introduced in the 2.5.44 development kernel, and the interface was finalized as of 2.5.66.