Loading learning content...
Every networked application you've ever used—from web browsers to databases, from chat applications to distributed systems—communicates through a surprisingly elegant abstraction: the socket. When you open a TCP connection, send a UDP datagram, or establish a Unix domain socket for inter-process communication, you're interacting with one of the most successful API designs in computing history.
The Berkeley Sockets API, born in 4.2BSD in 1983, has become the de facto standard for network programming across virtually every operating system. Linux's implementation is not merely a faithful reproduction—it's a sophisticated, highly optimized subsystem that must handle everything from simple client-server applications to the multi-million-connection workloads of hyperscale data centers.
Understanding the socket layer isn't just about learning an API. It's about comprehending the architectural bridge between user-space applications and the kernel's protocol implementations—a bridge that determines performance characteristics, security boundaries, and the fundamental capabilities available to networked software.
By the end of this page, you will understand the Linux socket layer architecture, including the struct socket and struct sock data structures, the socket system call interface, address family abstraction, and how the socket layer integrates with protocol-specific implementations. You'll gain insight into the design decisions that make Linux networking both flexible and performant.
The Linux socket layer is built on a layered abstraction that cleanly separates user-space concerns from kernel-space implementation details. This separation enables protocol independence at the API level while allowing specialized optimizations at the protocol level.
The fundamental design philosophy:
The socket layer implements the object-oriented principle despite being written in C. Each socket is an object with:
This design allows applications to use the same system calls regardless of whether they're communicating over TCP/IP, UDP, Unix domain sockets, or exotic protocols like Bluetooth or CAN bus.
| Layer | Representation | Purpose | Key Data Structures |
|---|---|---|---|
| User Space | File Descriptor (int) | Application handle for I/O operations | fd, FILE* |
| VFS Layer | struct file | Unified file abstraction | f_op, private_data |
| Socket Layer | struct socket | Protocol-independent socket operations | ops, sk, type, state |
| Protocol Layer | struct sock | Protocol-specific implementation | sk_prot, sk_receive_queue |
| Network Layer | struct sk_buff | Packet buffer representation | data, len, protocol headers |
Understanding the two-structure design:
Linux employs a deliberate two-structure approach for sockets:
struct socket: The BSD-compatible abstraction that provides the user-facing interface. This structure is relatively small and contains generic socket state.
struct sock: The network-layer representation that holds protocol-specific state and data queues. This structure is much larger and contains the actual networking machinery.
This separation isn't merely organizational—it's architectural. The struct socket can exist without a corresponding struct sock during certain lifecycle phases, and multiple struct file objects can reference the same struct socket (after dup() system calls).
123456789101112131415161718192021222324252627
/** * struct socket - The user-visible socket structure * * This structure represents the user-space view of a socket. * It contains the minimum state needed for file operations * and references the protocol-specific struct sock. */struct socket { socket_state state; /* Socket state (SS_*) */ short type; /* SOCK_STREAM, SOCK_DGRAM, etc. */ unsigned long flags; /* Socket flags (SOCK_NOSPACE, etc.) */ struct file *file; /* Back pointer to file structure */ struct sock *sk; /* Protocol-specific socket structure */ const struct proto_ops *ops; /* Protocol operations table */ struct socket_wq wq; /* Wait queue for async notifications */}; /* Socket states */typedef enum { SS_FREE = 0, /* Not allocated */ SS_UNCONNECTED, /* Unconnected to any peer */ SS_CONNECTING, /* In process of connecting */ SS_CONNECTED, /* Connected to peer */ SS_DISCONNECTING /* In process of disconnecting */} socket_state;The dual-structure design originated from the need to support multiple protocol families with a single user API. The struct socket provides the BSD socket semantics that applications expect, while struct sock contains the implementation details that vary dramatically between TCP, UDP, SCTP, Unix sockets, and other protocols. This separation of concerns enables code reuse and simplifies protocol development.
While struct socket is the user-facing abstraction, struct sock is where the real networking magic happens. This structure—often called the "sock" or "network socket"—contains the complete state of a network connection and is one of the most complex structures in the Linux kernel.
The structure's organization reflects its responsibilities:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
/** * struct sock - Network layer representation of sockets * * This is the protocol-agnostic base structure that all * protocol-specific socket structures embed or extend. */struct sock { /* * Cache line organization is critical for performance. * Frequently accessed fields are grouped together. */ /* First cache line: Hot path fields */ struct sock_common __sk_common; /* Shared with inet_timewait_sock */ /* Receive queue - incoming packets waiting for recv() */ struct sk_buff_head sk_receive_queue; /* Write queue - packets scheduled for transmission */ struct sk_buff_head sk_write_queue; /* Error queue - ICMP errors, timestamps, etc. */ struct sk_buff_head sk_error_queue; /* Backlog queue - packets received during user processing */ struct sk_buff_head sk_backlog; /* Memory accounting */ atomic_t sk_rmem_alloc; /* Receive buffer usage */ atomic_t sk_wmem_alloc; /* Write buffer usage */ atomic_t sk_omem_alloc; /* Optional memory usage */ /* Buffer limits (set via setsockopt) */ int sk_sndbuf; /* Send buffer size */ int sk_rcvbuf; /* Receive buffer size */ /* Socket flags and options */ unsigned long sk_flags; /* SO_KEEPALIVE, etc. */ unsigned int sk_shutdown; /* RCV/SEND shutdown flags */ /* Protocol operations */ struct proto *sk_prot; /* Protocol callbacks */ struct proto *sk_prot_creator; /* Creator protocol */ /* Wait queue for blocking operations */ struct socket_wq __rcu *sk_wq; /* Timer for various purposes (retransmit, keepalive) */ struct timer_list sk_timer; /* Timestamps */ ktime_t sk_stamp; /* Last packet timestamp */ /* ... many more fields ... */}; /* Embedded common structure for hash table lookups */struct sock_common { union { __addrpair skc_addrpair; /* Foreign/local IPv4 addresses */ struct { __be32 skc_daddr; /* Foreign IPv4 address */ __be32 skc_rcv_saddr; /* Bound local IPv4 address */ }; }; union { __portpair skc_portpair; /* Foreign/local ports */ struct { __be16 skc_dport; /* Foreign port */ __u16 skc_num; /* Local port */ }; }; unsigned short skc_family; /* Address family (AF_INET, etc.) */ volatile unsigned char skc_state; /* Connection state */ unsigned char skc_reuse; /* SO_REUSEADDR, SO_REUSEPORT */ int skc_bound_dev_if; /* Bound device index */};The struct sock is carefully organized with cache line efficiency in mind. Fields accessed together in the fast path (connection lookup, packet reception) are placed in the same cache line. This matters enormously—a cache miss costs ~100 cycles on modern CPUs, and high-performance networking code can be dominated by memory access patterns rather than computation.
Protocol-specific socket extensions:
Different protocols extend the base struct sock with their own fields. This is accomplished through C structure embedding, where the generic sock is placed at the beginning of protocol-specific structures:
/* TCP socket - extends sock with TCP-specific state */
struct tcp_sock {
struct inet_connection_sock inet_conn; /* Contains struct sock */
/* TCP-specific fields */
u32 snd_una; /* First unacknowledged byte */
u32 snd_nxt; /* Next byte to send */
u32 rcv_nxt; /* Expected next receive sequence */
u32 snd_wnd; /* Send window size */
u32 rcv_wnd; /* Receive window size */
/* ... hundreds more TCP-specific fields ... */
};
This embedding pattern enables polymorphic behavior: code handling generic sockets can access the base struct sock fields, while protocol-specific code can cast to the extended structure when needed.
The socket API exposes network functionality through a carefully designed set of system calls. These calls have remained remarkably stable since their introduction in 4.2BSD—a testament to the quality of the original design.
The core socket system calls form two groups:
Each system call traverses a well-defined path through the kernel, from the syscall entry point through VFS, socket layer, and finally to the protocol implementation.
| System Call | Kernel Function | Purpose | Key Operations |
|---|---|---|---|
| socket() | __sys_socket() | Create new socket | Allocate struct socket, call protocol create() |
| bind() | __sys_bind() | Assign local address | Validate address, call protocol bind() |
| listen() | __sys_listen() | Mark as passive socket | Allocate accept queue, call protocol listen() |
| accept() | __sys_accept4() | Accept connection | Dequeue from accept queue, create new socket |
| connect() | __sys_connect() | Active connection open | Initiate handshake, call protocol connect() |
| send()/sendto() | __sys_sendto() | Transmit data | Copy to kernel, call protocol sendmsg() |
| recv()/recvfrom() | __sys_recvfrom() | Receive data | Dequeue from receive queue, copy to user |
| close() | sock_close() | Release socket | Teardown connection, free resources |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
/** * __sys_socket - Create a new socket * @family: Protocol family (AF_INET, AF_UNIX, etc.) * @type: Socket type (SOCK_STREAM, SOCK_DGRAM, etc.) * @protocol: Protocol number (usually 0 for default) * * This is the kernel implementation of the socket() system call. * It creates a new socket structure and associates it with a file descriptor. */int __sys_socket(int family, int type, int protocol){ struct socket *sock; int flags, retval; /* Extract type flags (SOCK_NONBLOCK, SOCK_CLOEXEC) */ flags = type & ~SOCK_TYPE_MASK; type &= SOCK_TYPE_MASK; /* Validate socket type */ if (type < 0 || type >= SOCK_MAX) return -EINVAL; /* Create the socket structure */ retval = sock_create(family, type, protocol, &sock); if (retval < 0) return retval; /* * Map the socket to a file descriptor. * This creates struct file and allocates an fd. */ return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));} /** * sock_create - Allocate and initialize a socket * * This function finds the appropriate protocol family handler * and calls its create() method to initialize the socket. */int sock_create(int family, int type, int protocol, struct socket **res){ struct socket *sock; const struct net_proto_family *pf; int err; /* Allocate socket structure */ sock = sock_alloc(); if (!sock) return -ENFILE; sock->type = type; /* Find and call protocol family handler */ pf = rcu_dereference(net_families[family]); if (!pf || !pf->create) goto out_release; /* * Call protocol-specific create function. * For AF_INET/SOCK_STREAM, this eventually calls tcp_v4_init_sock(). */ err = pf->create(current->nsproxy->net_ns, sock, protocol, 0); if (err < 0) goto out_release; *res = sock; return 0; out_release: sock_release(sock); return err;}The flow of a connect() call:
To understand how system calls traverse the socket layer, let's trace a connect() call on a TCP socket:
connect(fd, addr, addrlen)__sys_connect()sockfd_lookup_light() finds struct socket from fdmove_addr_to_kernel() copies sockaddr from user spacesock->ops->connect() is calledinet_stream_connect() → tcp_v4_connect() → tcp_connect()This layered dispatch pattern is the key to socket layer extensibility—new protocols can be added by registering handlers without modifying core socket code.
One of the most elegant aspects of the socket API is its address family abstraction. The same connect(), bind(), and sendto() calls work identically whether you're using IPv4, IPv6, Unix domain sockets, Bluetooth, or any other supported protocol—the difference lies only in the address structure passed.
Address family registration:
Each address family registers a struct net_proto_family with the kernel, providing a create() function that initializes sockets for that family. The global net_families[] array maps family numbers (AF_INET = 2, AF_UNIX = 1, etc.) to their handlers.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
/** * struct net_proto_family - Protocol family definition * * Each address family (AF_INET, AF_UNIX, etc.) provides * one of these structures to define its socket creation handler. */struct net_proto_family { int family; /* Address family number */ int (*create)(struct net *net, struct socket *sock, int protocol, int kern); struct module *owner; /* Module providing this family */}; /* Example: IPv4 family registration */static const struct net_proto_family inet_family_ops = { .family = AF_INET, .create = inet_create, /* Called for socket(AF_INET, ...) */ .owner = THIS_MODULE,}; /* Example: Unix domain socket family */static const struct net_proto_family unix_family_ops = { .family = AF_UNIX, .create = unix_create, .owner = THIS_MODULE,}; /* Registration during module init */static int __init inet_init(void){ /* ... initialization ... */ (void)sock_register(&inet_family_ops); /* ... more initialization ... */} /** * inet_create - Create an INET family socket * * This function handles socket(AF_INET, type, protocol) calls. * It determines the correct protocol handler (TCP, UDP, RAW, etc.) * and initializes the socket accordingly. */static int inet_create(struct net *net, struct socket *sock, int protocol, int kern){ struct inet_protosw *answer; struct proto *answer_prot; struct sock *sk; int err; /* Find protocol switch entry for this type/protocol combination */ list_for_each_entry_rcu(answer, &inetsw[sock->type], list) { if (protocol == answer->protocol || protocol == IPPROTO_IP) { /* Found matching protocol handler */ break; } } /* Setup socket operations table */ sock->ops = answer->ops; /* proto_ops for this type */ answer_prot = answer->prot; /* struct proto for this protocol */ /* Allocate struct sock (protocol-specific) */ sk = sk_alloc(net, PF_INET, GFP_KERNEL, answer_prot, kern); if (!sk) return -ENOBUFS; /* Initialize the socket */ sock_init_data(sock, sk); /* Call protocol-specific init (tcp_v4_init_sock, udp_init_sock, etc.) */ if (sk->sk_prot->init) { err = sk->sk_prot->init(sk); if (err) goto out_free; } return 0; out_free: sk_free(sk); return err;}The sockaddr abstraction:
Address families use different address structures, but all share a common header that allows the kernel to determine the family before interpreting the rest:
/* Generic socket address (minimum required) */
struct sockaddr {
sa_family_t sa_family; /* Address family */
char sa_data[14]; /* Protocol-specific address */
};
/* IPv4 address */
struct sockaddr_in {
sa_family_t sin_family; /* AF_INET */
__be16 sin_port; /* Port number */
struct in_addr sin_addr; /* IPv4 address */
unsigned char __pad[8]; /* Padding to sockaddr size */
};
/* IPv6 address */
struct sockaddr_in6 {
sa_family_t sin6_family; /* AF_INET6 */
__be16 sin6_port; /* Port number */
__be32 sin6_flowinfo; /* IPv6 flow info */
struct in6_addr sin6_addr; /* IPv6 address */
__u32 sin6_scope_id; /* Scope ID */
};
/* Unix domain socket address */
struct sockaddr_un {
sa_family_t sun_family; /* AF_UNIX */
char sun_path[108]; /* Pathname */
};
The kernel checks sa_family and then casts to the appropriate structure for that family. This simple pattern enables the single API to support dramatically different addressing schemes.
Applications that need to handle multiple address families without knowing the type at compile time use struct sockaddr_storage—a structure large enough to hold any address type with proper alignment. This is essential for protocol-agnostic code like accept() handlers that must store client addresses of unknown family.
The socket layer achieves protocol independence through operation tables—structures filled with function pointers that implement protocol-specific behavior. This is the C equivalent of virtual method tables in object-oriented languages.
Two levels of operations:
struct proto_ops: User-facing operations attached to struct socket. These correspond directly to socket system calls and perform validation, state checks, and user/kernel data copying before delegating to protocol-specific code.
struct proto: Network layer operations attached to struct sock. These implement the core protocol logic—connection establishment, data transmission, congestion control, etc.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
/** * struct proto_ops - Socket operation table * * These operations are attached to struct socket and implement * the BSD socket API. They handle user-kernel interface concerns. */struct proto_ops { int family; struct module *owner; /* Socket lifecycle */ int (*release)(struct socket *sock); int (*bind)(struct socket *sock, struct sockaddr *addr, int addrlen); /* Connection management */ int (*connect)(struct socket *sock, struct sockaddr *addr, int addrlen, int flags); int (*accept)(struct socket *sock, struct socket *newsock, int flags); int (*listen)(struct socket *sock, int backlog); /* Name/Address operations */ int (*getname)(struct socket *sock, struct sockaddr *addr, int peer); /* Data transfer */ int (*sendmsg)(struct socket *sock, struct msghdr *msg, size_t len); int (*recvmsg)(struct socket *sock, struct msghdr *msg, size_t len, int flags); /* Options */ int (*setsockopt)(struct socket *sock, int level, int optname, sockptr_t optval, unsigned int optlen); int (*getsockopt)(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen); /* Event notification */ __poll_t (*poll)(struct file *file, struct socket *sock, poll_table *wait); /* ... additional operations ... */}; /* TCP socket operations (for SOCK_STREAM over AF_INET) */const struct proto_ops inet_stream_ops = { .family = PF_INET, .owner = THIS_MODULE, .release = inet_release, .bind = inet_bind, .connect = inet_stream_connect, .accept = inet_accept, .listen = inet_listen, .getname = inet_getname, .sendmsg = inet_sendmsg, .recvmsg = inet_recvmsg, .poll = tcp_poll, .setsockopt = sock_common_setsockopt, .getsockopt = sock_common_getsockopt, /* ... */}; /** * struct proto - Protocol handler operations * * These operations implement core protocol logic and are attached * to struct sock. They handle the actual networking work. */struct proto { char name[32]; struct module *owner; /* Socket lifecycle */ int (*init)(struct sock *sk); void (*destroy)(struct sock *sk); void (*close)(struct sock *sk, long timeout); /* Connection management */ int (*connect)(struct sock *sk, struct sockaddr *addr, int addrlen); int (*disconnect)(struct sock *sk, int flags); struct sock *(*accept)(struct sock *sk, int flags, int *err); /* Data transfer */ int (*sendmsg)(struct sock *sk, struct msghdr *msg, size_t len); int (*recvmsg)(struct sock *sk, struct msghdr *msg, size_t len, int noblock, int flags, int *addr_len); /* Packet reception (called from bottom half) */ void (*data_ready)(struct sock *sk); /* Memory management */ atomic_t memory_allocated; /* Protocol memory usage */ int memory_pressure; /* Memory pressure indicator */ /* Sysctl tunables */ int *sysctl_wmem; /* Write buffer sysctl */ int *sysctl_rmem; /* Read buffer sysctl */ /* ... many more operations ... */}; /* TCP protocol handler */struct proto tcp_prot = { .name = "TCP", .owner = THIS_MODULE, .init = tcp_v4_init_sock, .close = tcp_close, .connect = tcp_v4_connect, .disconnect = tcp_disconnect, .accept = inet_csk_accept, .sendmsg = tcp_sendmsg, .recvmsg = tcp_recvmsg, /* ... */};The dispatch flow:
When an application calls send() on a TCP socket, the execution path is:
__sys_sendto() — System call entrysock_sendmsg() — Generic socket layersock->ops->sendmsg() → inet_sendmsg() — INET layersk->sk_prot->sendmsg() → tcp_sendmsg() — TCP implementationThis chain of indirection costs some CPU cycles, but it enables the clean separation that makes Linux networking so maintainable and extensible.
For performance-critical paths, the kernel sometimes bypasses the generic dispatch. For example, TCP fast paths may skip intermediate layers when conditions are favorable. The kernel also uses static keys (runtime-patchable jump instructions) to eliminate branches for common configurations.
Network data flows through the kernel in discrete chunks called socket buffers (struct sk_buff or "skb"). These structures are the workhorses of Linux networking—every packet received or transmitted is represented by an skb. The socket layer manages multiple queues of these buffers to coordinate between application I/O and network activity.
The key queues in struct sock:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
/** * sk_buff_head - Queue of socket buffers * * This doubly-linked list structure manages packet queues. * It includes a spinlock for concurrent access protection. */struct sk_buff_head { struct sk_buff *next; struct sk_buff *prev; __u32 qlen; /* Number of buffers in queue */ spinlock_t lock; /* Queue lock */}; /* Add packet to end of receive queue */void skb_queue_tail(struct sk_buff_head *list, struct sk_buff *skb){ unsigned long flags; spin_lock_irqsave(&list->lock, flags); __skb_queue_tail(list, skb); spin_unlock_irqrestore(&list->lock, flags);} /* Remove and return first packet from receive queue */struct sk_buff *skb_dequeue(struct sk_buff_head *list){ struct sk_buff *skb; unsigned long flags; spin_lock_irqsave(&list->lock, flags); skb = __skb_dequeue(list); spin_unlock_irqrestore(&list->lock, flags); return skb;} /** * Memory accounting for socket queues * * The kernel tracks memory used by each socket to enforce * buffer limits (SO_RCVBUF, SO_SNDBUF) and system-wide limits. */ /* Charge memory to receive buffer */int sk_rmem_schedule(struct sock *sk, struct sk_buff *skb, int size){ /* Check if socket receive buffer has room */ if (!sk_has_account(sk)) return 1; return size <= sk->sk_rcvbuf - atomic_read(&sk->sk_rmem_alloc) || sk_force_memory_schedule(sk);} /* Called when skb is queued to receive queue */void skb_set_owner_r(struct sk_buff *skb, struct sock *sk){ skb->sk = sk; skb->destructor = sock_rfree; atomic_add(skb->truesize, &sk->sk_rmem_alloc);} /* Called when skb is freed (data consumed by application) */void sock_rfree(struct sk_buff *skb){ struct sock *sk = skb->sk; atomic_sub(skb->truesize, &sk->sk_rmem_alloc); /* Wake writers if space available */ if (sock_writeable(sk)) sk->sk_data_ready(sk);}The backlog queue and locking:
Socket processing faces a threading challenge: packets can arrive (in softirq context) while the application is actively reading (in process context). The sk_backlog queue solves this elegantly:
recv() and acquires sk->sk_lock.slocksk_backlogThe backlog mechanism is crucial for performance—without it, packet processing would block on application activity, or vice versa.
When socket buffers fill up (sk_rmem_alloc exceeds sk_rcvbuf), the kernel drops incoming packets. For TCP, this triggers flow control—the receive window shrinks, slowing the sender. For UDP, packets are silently dropped. This is why proper buffer sizing (via setsockopt or sysctls) is critical for high-throughput applications.
Socket behavior is extensively configurable through the setsockopt() and getsockopt() system calls. Options are organized by level—the layer of the networking stack that implements them:
| Option | Level | Purpose | Impact |
|---|---|---|---|
| SO_REUSEADDR | SOL_SOCKET | Allow address reuse | Enables server restart without TIME_WAIT delay |
| SO_REUSEPORT | SOL_SOCKET | Allow port sharing | Multiple sockets can bind same port (load balancing) |
| SO_RCVBUF | SOL_SOCKET | Receive buffer size | Controls maximum pending data; affects throughput |
| SO_SNDBUF | SOL_SOCKET | Send buffer size | Controls outgoing queue depth; affects throughput |
| SO_KEEPALIVE | SOL_SOCKET | Enable keepalives | Detect dead peers; important for idle connections |
| SO_LINGER | SOL_SOCKET | Linger on close | Controls close() behavior with pending data |
| TCP_NODELAY | IPPROTO_TCP | Disable Nagle | Reduces latency for small writes (interactive apps) |
| TCP_CORK | IPPROTO_TCP | Cork output | Batches small writes; opposite of NODELAY |
| TCP_QUICKACK | IPPROTO_TCP | Disable delayed ACK | Send ACKs immediately; reduces latency |
| IP_TOS | IPPROTO_IP | Type of Service | Sets DSCP/ECN bits for QoS |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
/** * sock_common_setsockopt - Generic setsockopt implementation * * This function dispatches to the appropriate handler based * on the option level (SOL_SOCKET vs protocol-specific). */int sock_common_setsockopt(struct socket *sock, int level, int optname, sockptr_t optval, unsigned int optlen){ struct sock *sk = sock->sk; /* SOL_SOCKET options are handled generically */ if (level == SOL_SOCKET) return sock_setsockopt(sock, level, optname, optval, optlen); /* Delegate to protocol-specific handler */ return sk->sk_prot->setsockopt(sk, level, optname, optval, optlen);} /** * sock_setsockopt - Handle SOL_SOCKET level options */int sock_setsockopt(struct socket *sock, int level, int optname, sockptr_t optval, unsigned int optlen){ struct sock *sk = sock->sk; int val; int ret = 0; if (copy_from_sockptr(&val, optval, sizeof(val))) return -EFAULT; lock_sock(sk); switch (optname) { case SO_REUSEADDR: sk->sk_reuse = (val ? SK_CAN_REUSE : SK_NO_REUSE); break; case SO_REUSEPORT: sk->sk_reuseport = val ? 1 : 0; break; case SO_RCVBUF: /* Clamp to system limits */ val = min_t(u32, val, sysctl_rmem_max); val = min_t(int, val, INT_MAX / 2); /* Kernel doubles the value (internal overhead accounting) */ WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF)); break; case SO_SNDBUF: val = min_t(u32, val, sysctl_wmem_max); val = min_t(int, val, INT_MAX / 2); sk->sk_userlocks |= SOCK_SNDBUF_LOCK; WRITE_ONCE(sk->sk_sndbuf, max_t(int, val * 2, SOCK_MIN_SNDBUF)); break; case SO_KEEPALIVE: if (sk->sk_prot->keepalive) sk->sk_prot->keepalive(sk, val); sock_valbool_flag(sk, SOCK_KEEPOPEN, val); break; case SO_LINGER: /* Handle struct linger instead of int */ /* ... special handling ... */ break; /* ... many more options ... */ default: ret = -ENOPROTOOPT; break; } release_sock(sk); return ret;}For high-bandwidth applications, socket buffers should be sized to handle the bandwidth-delay product (BDP = bandwidth × RTT). For a 10 Gbps link with 50ms RTT, BDP is ~62.5 MB. The kernel's autotuning (net.ipv4.tcp_moderate_rcvbuf) handles this for most cases, but manual tuning may be needed for extreme workloads.
The Linux socket layer is a masterfully designed abstraction that has scaled from simple client-server applications to handling millions of concurrent connections in modern cloud infrastructure. Understanding its architecture is essential for anyone building high-performance networked systems.
What's next:
With the socket layer understood, we'll dive deeper into the Linux networking stack. The next page explores the protocol stack architecture—how layers from socket to device driver are organized, how packets flow between layers, and the critical structures and functions that implement TCP/IP networking in Linux.
You now understand the Linux socket layer architecture—the bridge between application network I/O and the kernel's protocol implementations. This foundation is essential for understanding performance characteristics, debugging networking issues, and building systems that can scale to handle massive workloads.