Recognizing the limitations of both poll(
)
and select( )
, the 2.6
Linux kernel[15] introduced the event poll (epoll)
facility. While more complex than the two earlier interfaces, epoll
solves the fundamental performance problem shared by both of them, and
adds several new features.
Both poll( )
and select( )
(discussed in Chapter 2) require the full list of file descriptors
to watch on each invocation. The kernel must then walk the list of each
file descriptor to be monitored. When this list grows large—it may
contain hundreds or even thousands of file descriptors—walking the list
on each invocation becomes a scalability bottleneck.
Epoll circumvents this problem by decoupling the monitor registration from the actual monitoring. One system call initializes an epoll context, another adds monitored file descriptors to or removes them from the context, and a third performs the actual event wait.
An epoll context is created via epoll_create( )
:
#include
int epoll_create (int size)
A successful call to epoll_create(
)
instantiates a new epoll instance, and returns a file
descriptor associated with the instance. This file descriptor has no
relationship to a real file; it is just a handle to be used with
subsequent calls using the epoll facility. The size
parameter is a hint to the kernel about
the number of file descriptors that are going to be monitored; it is
not the maximum number. Passing in a good approximation will result in
better performance, but the exact number is not required. On error,
the call returns −1
, and sets
errno
to one of the
following:
EINVAL
The
size
parameter is not a positive number.ENFILE
The system has reached the limit on the total number of open files.
ENOMEM
Insufficient memory was available to complete the operation.
A typical call is:
int epfd; epfd = epoll_create (100); /* plan to watch ~100 fds */ if (epfd < 0) perror ("epoll_create");
The file descriptor returned from epoll_create( )
should be destroyed via a
call to close( )
after polling is
finished.
The epoll_ctl( )
system call
can be used to add file descriptors to and remove file descriptors
from a given epoll context:
#include
int epoll_ctl (int epfd,
int op,
int fd,
struct epoll_event *event);
The header
defines the epoll_event
structure as:
struct epoll_event { _ _u32 events; /* events */ union { void *ptr; int fd; _ _u32 u32; _ _u64 u64; } data; };
A successful call to epoll_ctl(
)
controls the epoll instance associated with the file
descriptor epfd
. The parameter
op
specifies the operation to be
taken against the file associated with fd
. The event
parameter further describes the
behavior of the operation.
Here are valid values for the op
parameter:
EPOLL_CTL_ADD
Add a monitor on the file associated with the file descriptor
fd
to theepoll
instance associated withepfd
, per the events defined inevent
.EPOLL_CTL_DEL
Remove a monitor on the file associated with the file descriptor
fd
from the epoll instance associated withepfd
.EPOLL_CTL_MOD
Modify an existing monitor of
fd
with the updated events specified byevent
.
The events
field in the
epoll_event
structure lists which
events to monitor on the given file descriptor. Multiple events can be
bitwise-ORed together. Here are valid values:
EPOLLERR
An error condition occurred on the file. This event is always monitored, even if it's not specified.
EPOLLET
Enables edge-triggered behavior for the monitor of the file (see the upcoming section "Edge- Versus Level-Triggered Events"). The default behavior is level-triggered.
EPOLLHUP
A hangup occurred on the file. This event is always monitored, even if it's not specified.
EPOLLIN
The file is available to be read from without blocking.
EPOLLONESHOT
After an event is generated and read, the file is automatically no longer monitored. A new event mask must be specified via
EPOLL_CTL_MOD
to reenable the watch.EPOLLOUT
The file is available to be written to without blocking.
EPOLLPRI
There is urgent out-of-band data available to read.
The data
field inside the
event_poll
structure is for the
user's private use. The contents are returned to the user upon receipt
of the requested event. The common practice is to set event.data.fd
to fd
, which makes it easy to look up which
file descriptor caused the event.
Upon success, epoll_ctl( )
returns 0
. On failure, the call
returns −1
, and sets errno
to one of the following values:
EBADF
epfd
is not a valid epoll instance, orfd
is not a valid file descriptor.EEXIST
op
wasEPOLL_CTL_ADD
, butfd
is already associated withepfd
.EINVAL
epfd
is not an epoll instance,epfd
is the same asfd
, orop
is invalid.ENOENT
op
wasEPOLL_CTL_MOD
, orEPOLL_CTL_DEL
, butfd
is not associated withepfd
.ENOMEM
There was insufficient memory to process the request.
EPERM
fd
does not support epoll.
As an example, to add a new watch on the file associated with
fd
to the epoll instance epfd
, you would write:
struct epoll_event event; int ret; event.data.fd = fd; /* return the fd to us later */ event.events = EPOLLIN | EPOLLOUT; ret = epoll_ctl (epfd, EPOLL_CTL_ADD, fd, &event); if (ret) perror ("epoll_ctl");
To modify an existing event on the file associated with fd
on the epoll instance epfd
, you would write:
struct epoll_event event; int ret; event.data.fd = fd; /* return the fd to us later */ event.events = EPOLLIN; ret = epoll_ctl (epfd, EPOLL_CTL_MOD, fd, &event); if (ret) perror ("epoll_ctl");
Conversely, to remove an existing event on the file associated
with fd
from the epoll instance
epfd
, you would write:
struct epoll_event event; int ret; ret = epoll_ctl (epfd, EPOLL_CTL_DEL, fd, &event); if (ret) perror ("epoll_ctl");
Note that the event
parameter
can be NULL
when op
is EPOLL_CTL_DEL
, as there is no event mask to
provide. Kernel versions before 2.6.9, however, erroneously check for
this parameter to be non-NULL
. For
portability to these older kernels, you should pass in a valid
non-NULL
pointer; it will not be
touched. Kernel 2.6.9 fixed this bug.
The system call epoll_wait( )
waits for events on the file descriptors associated with the given
epoll instance:
#include
int epoll_wait (int epfd,
struct epoll_event *events,
int maxevents,
int timeout);
A call to epoll_wait( )
waits
up to timeout
milliseconds for
events on the files associated with the epoll instance epfd
. Upon success, events
points to memory containing epoll_event
structures describing each
event, up to a maximum of maxevents
events. The return value is the number of events, or −1
on error, in which case errno
is set to one of the following:
EBADF
epfd
is not a valid file descriptor.EFAULT
The process does not have write access to the memory pointed at by
events
.EINTR
The system call was interrupted by a signal before it could complete.
EINVAL
epfd
is not a valid epoll instance, ormaxevents
is equal to or less than0
.
If timeout
is 0
, the call returns immediately, even if no
events are available, in which case the call will return 0
. If the timeout
is −1
, the call will not return until an event
is available.
When the call returns, the events
field of the epoll_event
structure describes the events
that occurred. The data
field
contains whatever the user set it to before invocation of epoll_ctl( )
.
A full epoll_wait( )
example
looks like this:
#define MAX_EVENTS 64 struct epoll_event *events; int nr_events, i, epfd; events = malloc (sizeof (struct epoll_event) * MAX_EVENTS); if (!events) { perror ("malloc"); return 1; } nr_events = epoll_wait (epfd, events, MAX_EVENTS, −1); if (nr_events < 0) { perror ("epoll_wait"); free (events); return 1; } for (i = 0; i < nr_events; i++) { printf ("event=%ld on fd=%d\n", events[i].events, events[i].data.fd); /* * We now can, per events[i].events, operate on * events[i].data.fd without blocking. */ } free (events);
We will cover the functions malloc(
)
and free( )
in Chapter 8.
If the EPOLLET
value is set
in the events
field of the event
parameter passed to epoll_ctl( )
, the watch on fd
is edge-triggered,
as opposed to level-triggered.
Consider the following events between a producer and a consumer communicating over a Unix pipe:
The producer writes 1 KB of data onto a pipe.
The consumer performs an
epoll_wait( )
on the pipe, waiting for the pipe to contain data, and thus be readable.
With a level-triggered watch, the call to epoll_wait( )
in step 2 will return
immediately, showing that the pipe is ready to read. With an
edge-triggered watch, this call will not return until after step 1
occurs. That is, even if the pipe is readable at the invocation of
epoll_wait( )
, the call will not
return until the data is written onto the pipe.
Level-triggered is the default behavior. It is how poll( )
and select(
)
behave, and it is what most developers expect.
Edge-triggered behavior requires a different approach to programming,
commonly utilizing nonblocking I/O, and careful checking for EAGAIN
.
Tip
The terminology comes from electrical engineering. A level-triggered interrupt is issued whenever a line is asserted. An edge-triggered interrupt is caused only during the rising or falling edge of the change in assertion. Level-triggered interrupts are useful when the state of the event (the asserted line) is of interest. Edge-triggered interrupts are useful when the event itself (the line being asserted) is of interest.
[15] * Epoll was introduced in the 2.5.44 development kernel, and the interface was finalized as of 2.5.66.