NAME
sosplice
, somove
— splice two sockets for
zero-copy data transfer
SYNOPSIS
int
sosplice
(struct
socket *so, int fd,
off_t max,
struct timeval *tv);
int
somove
(struct
socket *so, int
wait);
DESCRIPTION
The function
sosplice
()
is used to splice together a source and a drain socket. The source socket is
passed as the so argument; the file descriptor of the
drain is passed in fd. If fd is
negative, an existing splicing gets dissolved. If max
is positive, at most that many bytes will get transferred. If
tv is not NULL, a
timeout(9) is scheduled to dissolve splicing in the case when no data
can be transferred for the specified period of time. Socket splicing can be
invoked from userland via the
setsockopt(2) system-call at the SOL_SOCKET
level with the socket option SO_SPLICE
.
Before connecting both sockets, several checks are executed. See the ERRORS section for possible failures. The connection between both sockets is implemented by setting these additional fields in the struct sosplice *so_sp field in struct socket:
- struct socket *ssp_socket links from the source to the drain socket.
- struct socket *ssp_soback links back from the drain to the source socket.
- off_t ssp_len counts the number of bytes spliced so far from this socket.
- off_t ssp_max specifies the maximum number of bytes to splice from this socket if non-zero.
- struct timeval ssp_idletv specifies the maximum idle time if non-zero.
- struct timeout ssp_idleto provides storage for the kernel timeout if idle time is used.
After connecting both sockets,
sosplice
()
calls somove
() to transfer the mbufs already in the
source receive buffer to the drain send buffer. Finally the socket buffer
flag SB_SPLICE
is set on both socket buffers, to
indicate that the protocol layer has to call
somove
() whenever data or space is available.
The function
somove
()
transfers data from the source's receive buffer to the drain's send buffer.
It must be called at
splsoftnet(9) and so must be a spliced source
socket. It may be necessary to split an mbuf to handle out-of-band data
inline or when the maximum splice length has been reached. If
wait is M_WAIT
, splitting
mbufs will always succeed. For M_DONTWAIT
the
out-of-band property might get lost or a short splice might happen. In the
latter case, less than the given maximum number of bytes are transferred and
userland has to cope with this. Note that a short splice cannot happen if
somove
() was called by
sosplice
(). So a second
setsockopt(2) after a short splice pointing to the same maximum will
always succeed.
Before transferring data,
somove
()
checks both sockets for errors and that the drain socket is connected. If
the drain cannot send anymore, an EPIPE
error is set
on the source socket. The data length to move is limited by the optional
maximum splice length and the space in the drain's send socket buffer. Up to
this amount of data is taken out of the source's receive socket buffer. To
avoid splicing loops created by userland, the number of times an mbuf may be
moved between sockets is limited to 128.
For atomic protocols, either one complete packet is taken out, or
nothing is taken at all if: the packet is bigger than the drain's send
buffer size, in which case the splicing gets aborted with an
EMSGSIZE
error; the packet does not fit into the
drain's current send buffer space, in which case it is left in the source's
receive buffer for later processing; or the maximum splice length is located
within a packet, in which case splicing gets dissolved like a short splice.
All address or control mbufs associated with the taken packet are
dropped.
If the maximum splice length has been reached, an mbuf may get
split for non-atomic protocols. Otherwise an mbuf is either moved completely
to the send buffer or left in the receive buffer for later processing. If
SO_OOBINLINE is set, out-of-band data will get moved as such although this
might not be reliable. The data is sent out to the drain socket via the
protocol function. If that fails and the drain socket cannot send anymore,
an EPIPE
error is set on the source socket.
For packet oriented protocols
somove
()
iterates over the next packet queue.
If a maximum splice length was specified and at least
this amount of data has been received from the drain socket, splicing gets
dissolved. In this case, an EFBIG
error is set on
the source socket if the maximum amount of data has been transferred.
Userland can process this error to distinguish the full splice from a short
splice or to react to the completed maximum splice immediately. If an idle
timeout was specified and no data has been transferred for that period of
time, the handler
soidle
()
dissolves splicing and sets an ETIMEDOUT
error on
the source socket.
The function
sounsplice
()
is called to dissolve the socket splicing if the source socket cannot
receive anymore and its receive buffer is empty; or if the drain socket
cannot send anymore; or if the maximum has been reached; or if an error
occurred; or if the idle timeout has fired.
If the socket buffer flag
SB_SPLICE
is set, the functions
sorwakeup
()
and
sowwakeup
()
will call somove
() to trigger the transfer when new
data or buffer space is available. While socket splicing is active, any
read(2) from
the source socket will block. Neither read nor write wakeups will be
delivered to the file descriptors. After dissolving, a read event or a
socket error is signaled to userland on the source socket. If space is
available, a write event will be signaled on the drain socket.
RETURN VALUES
sosplice
() returns 0 on success and
otherwise the error number. somove
() returns 0 if
socket splicing has been finished and 1 if it continues.
ERRORS
sosplice
() will succeed unless:
- [
EBADF
] - The given file descriptor fd is not an active descriptor.
- [
EBUSY
] - The source or the drain socket is already spliced.
- [
EINVAL
] - The given maximum value max is negative.
- [
ENOTCONN
] - The source socket requires a connection and is neither connected nor in the process of connecting to a peer.
- [
ENOTCONN
] - The drain socket is neither connected nor in the process of connecting to a peer.
- [
ENOTSOCK
] - The given file descriptor fd is not a socket.
- [
EOPNOTSUPP
] - The source or the drain socket is a listen socket.
- [
EPROTONOSUPPORT
] - The source socket's protocol layer does not have the
PR_SPLICE
flag set. Only TCP and UDP socket splicing is supported. - [
EPROTONOSUPPORT
] - The drain socket's protocol does not have the same pr_usrreq function as the source.
- [
EWOULDBLOCK
] - The source socket is non-blocking and the receive buffer is already locked.
SEE ALSO
HISTORY
Socket splicing for TCP first appeared in OpenBSD 4.9; support for UDP was added in OpenBSD 5.3.
AUTHORS
The idea for socket splicing originally came from Markus Friedl <markus@openbsd.org>, and Alexander Bluhm <bluhm@openbsd.org> implemented it. Mike Belopuhov <mikeb@openbsd.org> added the timeout feature.