Recognizing the limitations of both poll(
) and select( ), the 2.6
Linux kernel[15] introduced the event poll (epoll)
facility. While more complex than the two earlier interfaces, epoll
solves the fundamental performance problem shared by both of them, and
adds several new features.
Both poll( ) and select( ) (discussed in Chapter 2) require the full list of file descriptors
to watch on each invocation. The kernel must then walk the list of each
file descriptor to be monitored. When this list grows large—it may
contain hundreds or even thousands of file descriptors—walking the list
on each invocation becomes a scalability bottleneck.
Epoll circumvents this problem by decoupling the monitor registration from the actual monitoring. One system call initializes an epoll context, another adds monitored file descriptors to or removes them from the context, and a third performs the actual event wait.
An epoll context is created via epoll_create( ):
#include
int epoll_create (int size) A successful call to epoll_create(
) instantiates a new epoll instance, and returns a file
descriptor associated with the instance. This file descriptor has no
relationship to a real file; it is just a handle to be used with
subsequent calls using the epoll facility. The size parameter is a hint to the kernel about
the number of file descriptors that are going to be monitored; it is
not the maximum number. Passing in a good approximation will result in
better performance, but the exact number is not required. On error,
the call returns −1, and sets
errno to one of the
following:
EINVALThe
sizeparameter is not a positive number.ENFILEThe system has reached the limit on the total number of open files.
ENOMEMInsufficient memory was available to complete the operation.
A typical call is:
int epfd;
epfd = epoll_create (100); /* plan to watch ~100 fds */
if (epfd < 0)
perror ("epoll_create");The file descriptor returned from epoll_create( ) should be destroyed via a
call to close( ) after polling is
finished.
The epoll_ctl( ) system call
can be used to add file descriptors to and remove file descriptors
from a given epoll context:
#include
int epoll_ctl (int epfd,
int op,
int fd,
struct epoll_event *event); The header defines the epoll_event structure as:
struct epoll_event {
_ _u32 events; /* events */
union {
void *ptr;
int fd;
_ _u32 u32;
_ _u64 u64;
} data;
};A successful call to epoll_ctl(
) controls the epoll instance associated with the file
descriptor epfd. The parameter
op specifies the operation to be
taken against the file associated with fd. The event parameter further describes the
behavior of the operation.
Here are valid values for the op parameter:
EPOLL_CTL_ADDAdd a monitor on the file associated with the file descriptor
fdto theepollinstance associated withepfd, per the events defined inevent.EPOLL_CTL_DELRemove a monitor on the file associated with the file descriptor
fdfrom the epoll instance associated withepfd.EPOLL_CTL_MODModify an existing monitor of
fdwith the updated events specified byevent.
The events field in the
epoll_event structure lists which
events to monitor on the given file descriptor. Multiple events can be
bitwise-ORed together. Here are valid values:
EPOLLERRAn error condition occurred on the file. This event is always monitored, even if it's not specified.
EPOLLETEnables edge-triggered behavior for the monitor of the file (see the upcoming section "Edge- Versus Level-Triggered Events"). The default behavior is level-triggered.
EPOLLHUPA hangup occurred on the file. This event is always monitored, even if it's not specified.
EPOLLINThe file is available to be read from without blocking.
EPOLLONESHOTAfter an event is generated and read, the file is automatically no longer monitored. A new event mask must be specified via
EPOLL_CTL_MODto reenable the watch.EPOLLOUTThe file is available to be written to without blocking.
EPOLLPRIThere is urgent out-of-band data available to read.
The data field inside the
event_poll structure is for the
user's private use. The contents are returned to the user upon receipt
of the requested event. The common practice is to set event.data.fd to fd, which makes it easy to look up which
file descriptor caused the event.
Upon success, epoll_ctl( )
returns 0. On failure, the call
returns −1, and sets errno to one of the following values:
EBADFepfdis not a valid epoll instance, orfdis not a valid file descriptor.EEXISTopwasEPOLL_CTL_ADD, butfdis already associated withepfd.EINVALepfdis not an epoll instance,epfdis the same asfd, oropis invalid.ENOENTopwasEPOLL_CTL_MOD, orEPOLL_CTL_DEL, butfdis not associated withepfd.ENOMEMThere was insufficient memory to process the request.
EPERMfddoes not support epoll.
As an example, to add a new watch on the file associated with
fd to the epoll instance epfd, you would write:
struct epoll_event event;
int ret;
event.data.fd = fd; /* return the fd to us later */
event.events = EPOLLIN | EPOLLOUT;
ret = epoll_ctl (epfd, EPOLL_CTL_ADD, fd, &event);
if (ret)
perror ("epoll_ctl");To modify an existing event on the file associated with fd on the epoll instance epfd, you would write:
struct epoll_event event;
int ret;
event.data.fd = fd; /* return the fd to us later */
event.events = EPOLLIN;
ret = epoll_ctl (epfd, EPOLL_CTL_MOD, fd, &event);
if (ret)
perror ("epoll_ctl");Conversely, to remove an existing event on the file associated
with fd from the epoll instance
epfd, you would write:
struct epoll_event event;
int ret;
ret = epoll_ctl (epfd, EPOLL_CTL_DEL, fd, &event);
if (ret)
perror ("epoll_ctl");Note that the event parameter
can be NULL when op is EPOLL_CTL_DEL, as there is no event mask to
provide. Kernel versions before 2.6.9, however, erroneously check for
this parameter to be non-NULL. For
portability to these older kernels, you should pass in a valid
non-NULL pointer; it will not be
touched. Kernel 2.6.9 fixed this bug.
The system call epoll_wait( )
waits for events on the file descriptors associated with the given
epoll instance:
#include
int epoll_wait (int epfd,
struct epoll_event *events,
int maxevents,
int timeout); A call to epoll_wait( ) waits
up to timeout milliseconds for
events on the files associated with the epoll instance epfd. Upon success, events points to memory containing epoll_event structures describing each
event, up to a maximum of maxevents
events. The return value is the number of events, or −1 on error, in which case errno is set to one of the following:
EBADFepfdis not a valid file descriptor.EFAULTThe process does not have write access to the memory pointed at by
events.EINTRThe system call was interrupted by a signal before it could complete.
EINVALepfdis not a valid epoll instance, ormaxeventsis equal to or less than0.
If timeout is 0, the call returns immediately, even if no
events are available, in which case the call will return 0. If the timeout is −1, the call will not return until an event
is available.
When the call returns, the events field of the epoll_event structure describes the events
that occurred. The data field
contains whatever the user set it to before invocation of epoll_ctl( ).
A full epoll_wait( ) example
looks like this:
#define MAX_EVENTS 64
struct epoll_event *events;
int nr_events, i, epfd;
events = malloc (sizeof (struct epoll_event) * MAX_EVENTS);
if (!events) {
perror ("malloc");
return 1;
}
nr_events = epoll_wait (epfd, events, MAX_EVENTS, −1);
if (nr_events < 0) {
perror ("epoll_wait");
free (events);
return 1;
}
for (i = 0; i < nr_events; i++) {
printf ("event=%ld on fd=%d\n",
events[i].events,
events[i].data.fd);
/*
* We now can, per events[i].events, operate on
* events[i].data.fd without blocking.
*/
}
free (events);We will cover the functions malloc(
) and free( ) in Chapter 8.
If the EPOLLET value is set
in the events field of the event parameter passed to epoll_ctl( ), the watch on fd is edge-triggered,
as opposed to level-triggered.
Consider the following events between a producer and a consumer communicating over a Unix pipe:
The producer writes 1 KB of data onto a pipe.
The consumer performs an
epoll_wait( )on the pipe, waiting for the pipe to contain data, and thus be readable.
With a level-triggered watch, the call to epoll_wait( ) in step 2 will return
immediately, showing that the pipe is ready to read. With an
edge-triggered watch, this call will not return until after step 1
occurs. That is, even if the pipe is readable at the invocation of
epoll_wait( ), the call will not
return until the data is written onto the pipe.
Level-triggered is the default behavior. It is how poll( ) and select(
) behave, and it is what most developers expect.
Edge-triggered behavior requires a different approach to programming,
commonly utilizing nonblocking I/O, and careful checking for EAGAIN.
Tip
The terminology comes from electrical engineering. A level-triggered interrupt is issued whenever a line is asserted. An edge-triggered interrupt is caused only during the rising or falling edge of the change in assertion. Level-triggered interrupts are useful when the state of the event (the asserted line) is of interest. Edge-triggered interrupts are useful when the event itself (the line being asserted) is of interest.
[15] * Epoll was introduced in the 2.5.44 development kernel, and the interface was finalized as of 2.5.66.