Operating SystemsLinux Internals

Linux Networking

LevelAdvanced

Duration180 mins

TopicLinux Internals

1 / 5

Socket Layer

The Universal Network Interface

Every networked application you've ever used—from web browsers to databases, from chat applications to distributed systems—communicates through a surprisingly elegant abstraction: the socket. When you open a TCP connection, send a UDP datagram, or establish a Unix domain socket for inter-process communication, you're interacting with one of the most successful API designs in computing history.

The Berkeley Sockets API, born in 4.2BSD in 1983, has become the de facto standard for network programming across virtually every operating system. Linux's implementation is not merely a faithful reproduction—it's a sophisticated, highly optimized subsystem that must handle everything from simple client-server applications to the multi-million-connection workloads of hyperscale data centers.

Understanding the socket layer isn't just about learning an API. It's about comprehending the architectural bridge between user-space applications and the kernel's protocol implementations—a bridge that determines performance characteristics, security boundaries, and the fundamental capabilities available to networked software.

What You Will Learn

By the end of this page, you will understand the Linux socket layer architecture, including the struct socket and struct sock data structures, the socket system call interface, address family abstraction, and how the socket layer integrates with protocol-specific implementations. You'll gain insight into the design decisions that make Linux networking both flexible and performant.

Socket Abstraction Architecture

The Linux socket layer is built on a layered abstraction that cleanly separates user-space concerns from kernel-space implementation details. This separation enables protocol independence at the API level while allowing specialized optimizations at the protocol level.

The fundamental design philosophy:

The socket layer implements the object-oriented principle despite being written in C. Each socket is an object with:

State: Connection status, pending data, configuration options
Behavior: Protocol-specific operations (connect, accept, send, receive)
Identity: File descriptor, address bindings, peer information

This design allows applications to use the same system calls regardless of whether they're communicating over TCP/IP, UDP, Unix domain sockets, or exotic protocols like Bluetooth or CAN bus.

Socket Layer Hierarchy
Layer	Representation	Purpose	Key Data Structures
User Space	File Descriptor (int)	Application handle for I/O operations	fd, FILE*
VFS Layer	struct file	Unified file abstraction	f_op, private_data
Socket Layer	struct socket	Protocol-independent socket operations	ops, sk, type, state
Protocol Layer	struct sock	Protocol-specific implementation	sk_prot, sk_receive_queue
Network Layer	struct sk_buff	Packet buffer representation	data, len, protocol headers

Understanding the two-structure design:

Linux employs a deliberate two-structure approach for sockets:

struct socket: The BSD-compatible abstraction that provides the user-facing interface. This structure is relatively small and contains generic socket state.
struct sock: The network-layer representation that holds protocol-specific state and data queues. This structure is much larger and contains the actual networking machinery.

This separation isn't merely organizational—it's architectural. The struct socket can exist without a corresponding struct sock during certain lifecycle phases, and multiple struct file objects can reference the same struct socket (after dup() system calls).

struct socket (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
/**
 * struct socket - The user-visible socket structure
 * 
 * This structure represents the user-space view of a socket.
 * It contains the minimum state needed for file operations
 * and references the protocol-specific struct sock.
 */
struct socket {
    socket_state            state;      /* Socket state (SS_*) */
    short                   type;       /* SOCK_STREAM, SOCK_DGRAM, etc. */
    unsigned long           flags;      /* Socket flags (SOCK_NOSPACE, etc.) */
    
    struct file             *file;      /* Back pointer to file structure */
    struct sock             *sk;        /* Protocol-specific socket structure */
    const struct proto_ops  *ops;       /* Protocol operations table */
    
    struct socket_wq        wq;         /* Wait queue for async notifications */
};
 
/* Socket states */
typedef enum {
    SS_FREE = 0,            /* Not allocated */
    SS_UNCONNECTED,         /* Unconnected to any peer */
    SS_CONNECTING,          /* In process of connecting */
    SS_CONNECTED,           /* Connected to peer */
    SS_DISCONNECTING        /* In process of disconnecting */
} socket_state;

Why Two Structures?

The dual-structure design originated from the need to support multiple protocol families with a single user API. The struct socket provides the BSD socket semantics that applications expect, while struct sock contains the implementation details that vary dramatically between TCP, UDP, SCTP, Unix sockets, and other protocols. This separation of concerns enables code reuse and simplifies protocol development.

The struct sock Deep Dive

While struct socket is the user-facing abstraction, struct sock is where the real networking magic happens. This structure—often called the "sock" or "network socket"—contains the complete state of a network connection and is one of the most complex structures in the Linux kernel.

The structure's organization reflects its responsibilities:

Connection state: Source/destination addresses, ports, connection status
Data queues: Receive queue, send queue, backlog queue, error queue
Buffer management: Memory accounting, socket buffer limits
Timer management: Retransmission timers, keepalive timers
Protocol callbacks: Function pointers for protocol-specific behavior

struct sock (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
/**
 * struct sock - Network layer representation of sockets
 * 
 * This is the protocol-agnostic base structure that all
 * protocol-specific socket structures embed or extend.
 */
struct sock {
    /*
     * Cache line organization is critical for performance.
     * Frequently accessed fields are grouped together.
     */
    
    /* First cache line: Hot path fields */
    struct sock_common      __sk_common;    /* Shared with inet_timewait_sock */
    
    /* Receive queue - incoming packets waiting for recv() */
    struct sk_buff_head     sk_receive_queue;
    
    /* Write queue - packets scheduled for transmission */
    struct sk_buff_head     sk_write_queue;
    
    /* Error queue - ICMP errors, timestamps, etc. */
    struct sk_buff_head     sk_error_queue;
    
    /* Backlog queue - packets received during user processing */
    struct sk_buff_head     sk_backlog;
    
    /* Memory accounting */
    atomic_t                sk_rmem_alloc;      /* Receive buffer usage */
    atomic_t                sk_wmem_alloc;      /* Write buffer usage */
    atomic_t                sk_omem_alloc;      /* Optional memory usage */
    
    /* Buffer limits (set via setsockopt) */
    int                     sk_sndbuf;          /* Send buffer size */
    int                     sk_rcvbuf;          /* Receive buffer size */
    
    /* Socket flags and options */
    unsigned long           sk_flags;           /* SO_KEEPALIVE, etc. */
    unsigned int            sk_shutdown;        /* RCV/SEND shutdown flags */
    
    /* Protocol operations */
    struct proto            *sk_prot;           /* Protocol callbacks */
    struct proto            *sk_prot_creator;   /* Creator protocol */
    
    /* Wait queue for blocking operations */
    struct socket_wq __rcu  *sk_wq;
    
    /* Timer for various purposes (retransmit, keepalive) */
    struct timer_list       sk_timer;
    
    /* Timestamps */
    ktime_t                 sk_stamp;           /* Last packet timestamp */
    
    /* ... many more fields ... */
};
 
/* Embedded common structure for hash table lookups */
struct sock_common {
    union {
        __addrpair      skc_addrpair;       /* Foreign/local IPv4 addresses */
        struct {
            __be32      skc_daddr;          /* Foreign IPv4 address */
            __be32      skc_rcv_saddr;      /* Bound local IPv4 address */
        };
    };
    union {
        __portpair      skc_portpair;       /* Foreign/local ports */
        struct {
            __be16      skc_dport;          /* Foreign port */
            __u16       skc_num;            /* Local port */
        };
    };
    unsigned short      skc_family;         /* Address family (AF_INET, etc.) */
    volatile unsigned char skc_state;       /* Connection state */
    unsigned char       skc_reuse;          /* SO_REUSEADDR, SO_REUSEPORT */
    int                 skc_bound_dev_if;   /* Bound device index */
};

Cache Line Optimization

The struct sock is carefully organized with cache line efficiency in mind. Fields accessed together in the fast path (connection lookup, packet reception) are placed in the same cache line. This matters enormously—a cache miss costs ~100 cycles on modern CPUs, and high-performance networking code can be dominated by memory access patterns rather than computation.

Protocol-specific socket extensions:

Different protocols extend the base struct sock with their own fields. This is accomplished through C structure embedding, where the generic sock is placed at the beginning of protocol-specific structures:

/* TCP socket - extends sock with TCP-specific state */
struct tcp_sock {
    struct inet_connection_sock inet_conn;  /* Contains struct sock */
    
    /* TCP-specific fields */
    u32     snd_una;        /* First unacknowledged byte */
    u32     snd_nxt;        /* Next byte to send */
    u32     rcv_nxt;        /* Expected next receive sequence */
    u32     snd_wnd;        /* Send window size */
    u32     rcv_wnd;        /* Receive window size */
    /* ... hundreds more TCP-specific fields ... */
};

This embedding pattern enables polymorphic behavior: code handling generic sockets can access the base struct sock fields, while protocol-specific code can cast to the extended structure when needed.

Key struct sock Responsibilities

•Connection Identity — Stores the 5-tuple (protocol, source IP, source port, destination IP, destination port) that uniquely identifies a connection
•Buffer Management — Manages receive and send queues with memory accounting to enforce SO_RCVBUF and SO_SNDBUF limits
•State Machine — Tracks connection state (ESTABLISHED, CLOSE_WAIT, TIME_WAIT, etc.) for connection-oriented protocols
•Wait Queue Management — Coordinates blocking I/O operations with the kernel's sleep/wakeup mechanism
•Timer Coordination — Manages protocol timers for retransmission, keepalive, delayed ACK, and other time-based events
•Reference Counting — Ensures the socket isn't freed while in use by multiple kernel subsystems

Socket System Call Interface

The socket API exposes network functionality through a carefully designed set of system calls. These calls have remained remarkably stable since their introduction in 4.2BSD—a testament to the quality of the original design.

The core socket system calls form two groups:

Setup calls: socket(), bind(), listen(), accept(), connect()
Data transfer calls: send(), recv(), sendto(), recvfrom(), sendmsg(), recvmsg()

Each system call traverses a well-defined path through the kernel, from the syscall entry point through VFS, socket layer, and finally to the protocol implementation.

Socket System Calls and Their Kernel Implementation
System Call	Kernel Function	Purpose	Key Operations
socket()	__sys_socket()	Create new socket	Allocate struct socket, call protocol create()
bind()	__sys_bind()	Assign local address	Validate address, call protocol bind()
listen()	__sys_listen()	Mark as passive socket	Allocate accept queue, call protocol listen()
accept()	__sys_accept4()	Accept connection	Dequeue from accept queue, create new socket
connect()	__sys_connect()	Active connection open	Initiate handshake, call protocol connect()
send()/sendto()	__sys_sendto()	Transmit data	Copy to kernel, call protocol sendmsg()
recv()/recvfrom()	__sys_recvfrom()	Receive data	Dequeue from receive queue, copy to user
close()	sock_close()	Release socket	Teardown connection, free resources

socket() system call implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
/**
 * __sys_socket - Create a new socket
 * @family: Protocol family (AF_INET, AF_UNIX, etc.)
 * @type: Socket type (SOCK_STREAM, SOCK_DGRAM, etc.)
 * @protocol: Protocol number (usually 0 for default)
 *
 * This is the kernel implementation of the socket() system call.
 * It creates a new socket structure and associates it with a file descriptor.
 */
int __sys_socket(int family, int type, int protocol)
{
    struct socket *sock;
    int flags, retval;
 
    /* Extract type flags (SOCK_NONBLOCK, SOCK_CLOEXEC) */
    flags = type & ~SOCK_TYPE_MASK;
    type &= SOCK_TYPE_MASK;
 
    /* Validate socket type */
    if (type < 0 || type >= SOCK_MAX)
        return -EINVAL;
 
    /* Create the socket structure */
    retval = sock_create(family, type, protocol, &sock);
    if (retval < 0)
        return retval;
 
    /*
     * Map the socket to a file descriptor.
     * This creates struct file and allocates an fd.
     */
    return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
}
 
/**
 * sock_create - Allocate and initialize a socket
 * 
 * This function finds the appropriate protocol family handler
 * and calls its create() method to initialize the socket.
 */
int sock_create(int family, int type, int protocol, struct socket **res)
{
    struct socket *sock;
    const struct net_proto_family *pf;
    int err;
 
    /* Allocate socket structure */
    sock = sock_alloc();
    if (!sock)
        return -ENFILE;
 
    sock->type = type;
 
    /* Find and call protocol family handler */
    pf = rcu_dereference(net_families[family]);
    if (!pf || !pf->create)
        goto out_release;
 
    /*
     * Call protocol-specific create function.
     * For AF_INET/SOCK_STREAM, this eventually calls tcp_v4_init_sock().
     */
    err = pf->create(current->nsproxy->net_ns, sock, protocol, 0);
    if (err < 0)
        goto out_release;
 
    *res = sock;
    return 0;
 
out_release:
    sock_release(sock);
    return err;
}

The flow of a connect() call:

To understand how system calls traverse the socket layer, let's trace a connect() call on a TCP socket:

User space: Application calls connect(fd, addr, addrlen)
Syscall entry: Kernel enters __sys_connect()
FD lookup: sockfd_lookup_light() finds struct socket from fd
Address copy: move_addr_to_kernel() copies sockaddr from user space
Security check: LSM hooks (SELinux, AppArmor) validate the operation
Protocol dispatch: sock->ops->connect() is called
For TCP: inet_stream_connect() → tcp_v4_connect() → tcp_connect()
Return to user: Connection initiated (or error returned)

This layered dispatch pattern is the key to socket layer extensibility—new protocols can be added by registering handlers without modifying core socket code.

Address Family Abstraction

One of the most elegant aspects of the socket API is its address family abstraction. The same connect(), bind(), and sendto() calls work identically whether you're using IPv4, IPv6, Unix domain sockets, Bluetooth, or any other supported protocol—the difference lies only in the address structure passed.

Address family registration:

Each address family registers a struct net_proto_family with the kernel, providing a create() function that initializes sockets for that family. The global net_families[] array maps family numbers (AF_INET = 2, AF_UNIX = 1, etc.) to their handlers.

Protocol family registration
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
/**
 * struct net_proto_family - Protocol family definition
 * 
 * Each address family (AF_INET, AF_UNIX, etc.) provides
 * one of these structures to define its socket creation handler.
 */
struct net_proto_family {
    int             family;     /* Address family number */
    int             (*create)(struct net *net, struct socket *sock,
                              int protocol, int kern);
    struct module   *owner;     /* Module providing this family */
};
 
/* Example: IPv4 family registration */
static const struct net_proto_family inet_family_ops = {
    .family = AF_INET,
    .create = inet_create,      /* Called for socket(AF_INET, ...) */
    .owner  = THIS_MODULE,
};
 
/* Example: Unix domain socket family */
static const struct net_proto_family unix_family_ops = {
    .family = AF_UNIX,
    .create = unix_create,
    .owner  = THIS_MODULE,
};
 
/* Registration during module init */
static int __init inet_init(void)
{
    /* ... initialization ... */
    
    (void)sock_register(&inet_family_ops);
    
    /* ... more initialization ... */
}
 
/**
 * inet_create - Create an INET family socket
 * 
 * This function handles socket(AF_INET, type, protocol) calls.
 * It determines the correct protocol handler (TCP, UDP, RAW, etc.)
 * and initializes the socket accordingly.
 */
static int inet_create(struct net *net, struct socket *sock,
                       int protocol, int kern)
{
    struct inet_protosw *answer;
    struct proto *answer_prot;
    struct sock *sk;
    int err;
 
    /* Find protocol switch entry for this type/protocol combination */
    list_for_each_entry_rcu(answer, &inetsw[sock->type], list) {
        if (protocol == answer->protocol ||
            protocol == IPPROTO_IP) {
            /* Found matching protocol handler */
            break;
        }
    }
 
    /* Setup socket operations table */
    sock->ops = answer->ops;            /* proto_ops for this type */
    answer_prot = answer->prot;         /* struct proto for this protocol */
 
    /* Allocate struct sock (protocol-specific) */
    sk = sk_alloc(net, PF_INET, GFP_KERNEL, answer_prot, kern);
    if (!sk)
        return -ENOBUFS;
 
    /* Initialize the socket */
    sock_init_data(sock, sk);
    
    /* Call protocol-specific init (tcp_v4_init_sock, udp_init_sock, etc.) */
    if (sk->sk_prot->init) {
        err = sk->sk_prot->init(sk);
        if (err)
            goto out_free;
    }
 
    return 0;
 
out_free:
    sk_free(sk);
    return err;
}

The sockaddr abstraction:

Address families use different address structures, but all share a common header that allows the kernel to determine the family before interpreting the rest:

/* Generic socket address (minimum required) */
struct sockaddr {
    sa_family_t sa_family;    /* Address family */
    char        sa_data[14];  /* Protocol-specific address */
};

/* IPv4 address */
struct sockaddr_in {
    sa_family_t     sin_family;   /* AF_INET */
    __be16          sin_port;     /* Port number */
    struct in_addr  sin_addr;     /* IPv4 address */
    unsigned char   __pad[8];     /* Padding to sockaddr size */
};

/* IPv6 address */
struct sockaddr_in6 {
    sa_family_t     sin6_family;   /* AF_INET6 */
    __be16          sin6_port;     /* Port number */
    __be32          sin6_flowinfo; /* IPv6 flow info */
    struct in6_addr sin6_addr;     /* IPv6 address */
    __u32           sin6_scope_id; /* Scope ID */
};

/* Unix domain socket address */
struct sockaddr_un {
    sa_family_t sun_family;        /* AF_UNIX */
    char        sun_path[108];     /* Pathname */
};

The kernel checks sa_family and then casts to the appropriate structure for that family. This simple pattern enables the single API to support dramatically different addressing schemes.

sockaddr_storage: The Universal Container

Applications that need to handle multiple address families without knowing the type at compile time use struct sockaddr_storage—a structure large enough to hold any address type with proper alignment. This is essential for protocol-agnostic code like accept() handlers that must store client addresses of unknown family.

Protocol Operations Table

The socket layer achieves protocol independence through operation tables—structures filled with function pointers that implement protocol-specific behavior. This is the C equivalent of virtual method tables in object-oriented languages.

Two levels of operations:

struct proto_ops: User-facing operations attached to struct socket. These correspond directly to socket system calls and perform validation, state checks, and user/kernel data copying before delegating to protocol-specific code.
struct proto: Network layer operations attached to struct sock. These implement the core protocol logic—connection establishment, data transmission, congestion control, etc.

Protocol operations tables
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
/**
 * struct proto_ops - Socket operation table
 * 
 * These operations are attached to struct socket and implement
 * the BSD socket API. They handle user-kernel interface concerns.
 */
struct proto_ops {
    int             family;
    struct module   *owner;
    
    /* Socket lifecycle */
    int (*release)(struct socket *sock);
    int (*bind)(struct socket *sock, struct sockaddr *addr, int addrlen);
    
    /* Connection management */
    int (*connect)(struct socket *sock, struct sockaddr *addr, 
                   int addrlen, int flags);
    int (*accept)(struct socket *sock, struct socket *newsock, int flags);
    int (*listen)(struct socket *sock, int backlog);
    
    /* Name/Address operations */
    int (*getname)(struct socket *sock, struct sockaddr *addr, int peer);
    
    /* Data transfer */
    int (*sendmsg)(struct socket *sock, struct msghdr *msg, size_t len);
    int (*recvmsg)(struct socket *sock, struct msghdr *msg, size_t len,
                   int flags);
    
    /* Options */
    int (*setsockopt)(struct socket *sock, int level, int optname,
                      sockptr_t optval, unsigned int optlen);
    int (*getsockopt)(struct socket *sock, int level, int optname,
                      char __user *optval, int __user *optlen);
    
    /* Event notification */
    __poll_t (*poll)(struct file *file, struct socket *sock,
                     poll_table *wait);
    
    /* ... additional operations ... */
};
 
/* TCP socket operations (for SOCK_STREAM over AF_INET) */
const struct proto_ops inet_stream_ops = {
    .family         = PF_INET,
    .owner          = THIS_MODULE,
    .release        = inet_release,
    .bind           = inet_bind,
    .connect        = inet_stream_connect,
    .accept         = inet_accept,
    .listen         = inet_listen,
    .getname        = inet_getname,
    .sendmsg        = inet_sendmsg,
    .recvmsg        = inet_recvmsg,
    .poll           = tcp_poll,
    .setsockopt     = sock_common_setsockopt,
    .getsockopt     = sock_common_getsockopt,
    /* ... */
};
 
/**
 * struct proto - Protocol handler operations
 * 
 * These operations implement core protocol logic and are attached
 * to struct sock. They handle the actual networking work.
 */
struct proto {
    char            name[32];
    struct module   *owner;
    
    /* Socket lifecycle */
    int  (*init)(struct sock *sk);
    void (*destroy)(struct sock *sk);
    void (*close)(struct sock *sk, long timeout);
    
    /* Connection management */
    int  (*connect)(struct sock *sk, struct sockaddr *addr, int addrlen);
    int  (*disconnect)(struct sock *sk, int flags);
    struct sock *(*accept)(struct sock *sk, int flags, int *err);
    
    /* Data transfer */
    int  (*sendmsg)(struct sock *sk, struct msghdr *msg, size_t len);
    int  (*recvmsg)(struct sock *sk, struct msghdr *msg, size_t len,
                    int noblock, int flags, int *addr_len);
    
    /* Packet reception (called from bottom half) */
    void (*data_ready)(struct sock *sk);
    
    /* Memory management */
    atomic_t        memory_allocated;      /* Protocol memory usage */
    int             memory_pressure;       /* Memory pressure indicator */
    
    /* Sysctl tunables */
    int             *sysctl_wmem;          /* Write buffer sysctl */
    int             *sysctl_rmem;          /* Read buffer sysctl */
    
    /* ... many more operations ... */
};
 
/* TCP protocol handler */
struct proto tcp_prot = {
    .name           = "TCP",
    .owner          = THIS_MODULE,
    .init           = tcp_v4_init_sock,
    .close          = tcp_close,
    .connect        = tcp_v4_connect,
    .disconnect     = tcp_disconnect,
    .accept         = inet_csk_accept,
    .sendmsg        = tcp_sendmsg,
    .recvmsg        = tcp_recvmsg,
    /* ... */
};

The dispatch flow:

When an application calls send() on a TCP socket, the execution path is:

__sys_sendto() — System call entry
sock_sendmsg() — Generic socket layer
sock->ops->sendmsg() → inet_sendmsg() — INET layer
sk->sk_prot->sendmsg() → tcp_sendmsg() — TCP implementation
TCP code builds segments, manages congestion, queues for transmission

This chain of indirection costs some CPU cycles, but it enables the clean separation that makes Linux networking so maintainable and extensible.

Optimization: Direct Call Paths

For performance-critical paths, the kernel sometimes bypasses the generic dispatch. For example, TCP fast paths may skip intermediate layers when conditions are favorable. The kernel also uses static keys (runtime-patchable jump instructions) to eliminate branches for common configurations.

Socket Buffers and Queues

Network data flows through the kernel in discrete chunks called socket buffers (struct sk_buff or "skb"). These structures are the workhorses of Linux networking—every packet received or transmitted is represented by an skb. The socket layer manages multiple queues of these buffers to coordinate between application I/O and network activity.

The key queues in struct sock:

Socket Layer Queues

•sk_receive_queue — Packets ready for the application to read. When the network stack delivers a complete TCP segment or UDP datagram, it's placed here. The recv() family of calls dequeue from this.
•sk_write_queue — Packets waiting to be transmitted. TCP uses this to hold segments until acknowledgment; UDP packets are queued briefly before transmission.
•sk_backlog — Packets that arrived while the application held the socket lock. Processed when the application releases the lock.
•sk_error_queue — Asynchronous errors and ancillary data (timestamps, ICMP errors). Retrieved via MSG_ERRQUEUE flag.
•accept queue (TCP) — Completed connections waiting for accept(). Actually two queues: SYN queue (incomplete) and accept queue (complete).

Socket queue operations
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
/**
 * sk_buff_head - Queue of socket buffers
 * 
 * This doubly-linked list structure manages packet queues.
 * It includes a spinlock for concurrent access protection.
 */
struct sk_buff_head {
    struct sk_buff  *next;
    struct sk_buff  *prev;
    __u32           qlen;       /* Number of buffers in queue */
    spinlock_t      lock;       /* Queue lock */
};
 
/* Add packet to end of receive queue */
void skb_queue_tail(struct sk_buff_head *list, struct sk_buff *skb)
{
    unsigned long flags;
    
    spin_lock_irqsave(&list->lock, flags);
    __skb_queue_tail(list, skb);
    spin_unlock_irqrestore(&list->lock, flags);
}
 
/* Remove and return first packet from receive queue */
struct sk_buff *skb_dequeue(struct sk_buff_head *list)
{
    struct sk_buff *skb;
    unsigned long flags;
    
    spin_lock_irqsave(&list->lock, flags);
    skb = __skb_dequeue(list);
    spin_unlock_irqrestore(&list->lock, flags);
    
    return skb;
}
 
/**
 * Memory accounting for socket queues
 * 
 * The kernel tracks memory used by each socket to enforce
 * buffer limits (SO_RCVBUF, SO_SNDBUF) and system-wide limits.
 */
 
/* Charge memory to receive buffer */
int sk_rmem_schedule(struct sock *sk, struct sk_buff *skb, int size)
{
    /* Check if socket receive buffer has room */
    if (!sk_has_account(sk))
        return 1;
    
    return size <= sk->sk_rcvbuf - atomic_read(&sk->sk_rmem_alloc) ||
           sk_force_memory_schedule(sk);
}
 
/* Called when skb is queued to receive queue */
void skb_set_owner_r(struct sk_buff *skb, struct sock *sk)
{
    skb->sk = sk;
    skb->destructor = sock_rfree;
    atomic_add(skb->truesize, &sk->sk_rmem_alloc);
}
 
/* Called when skb is freed (data consumed by application) */
void sock_rfree(struct sk_buff *skb)
{
    struct sock *sk = skb->sk;
    
    atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
    
    /* Wake writers if space available */
    if (sock_writeable(sk))
        sk->sk_data_ready(sk);
}

The backlog queue and locking:

Socket processing faces a threading challenge: packets can arrive (in softirq context) while the application is actively reading (in process context). The sk_backlog queue solves this elegantly:

Application calls recv() and acquires sk->sk_lock.slock
Packet arrives in softirq; tries to acquire lock
Since lock is held, packet is queued to sk_backlog
When application releases lock, it processes backlog
This avoids spinlock contention while ensuring packets aren't lost

The backlog mechanism is crucial for performance—without it, packet processing would block on application activity, or vice versa.

Buffer Exhaustion

When socket buffers fill up (sk_rmem_alloc exceeds sk_rcvbuf), the kernel drops incoming packets. For TCP, this triggers flow control—the receive window shrinks, slowing the sender. For UDP, packets are silently dropped. This is why proper buffer sizing (via setsockopt or sysctls) is critical for high-throughput applications.

Socket Options and Configuration

Socket behavior is extensively configurable through the setsockopt() and getsockopt() system calls. Options are organized by level—the layer of the networking stack that implements them:

SOL_SOCKET: Generic socket layer options
IPPROTO_IP: IPv4 layer options
IPPROTO_IPV6: IPv6 layer options
IPPROTO_TCP: TCP-specific options
IPPROTO_UDP: UDP-specific options

Essential Socket Options
Option	Level	Purpose	Impact
SO_REUSEADDR	SOL_SOCKET	Allow address reuse	Enables server restart without TIME_WAIT delay
SO_REUSEPORT	SOL_SOCKET	Allow port sharing	Multiple sockets can bind same port (load balancing)
SO_RCVBUF	SOL_SOCKET	Receive buffer size	Controls maximum pending data; affects throughput
SO_SNDBUF	SOL_SOCKET	Send buffer size	Controls outgoing queue depth; affects throughput
SO_KEEPALIVE	SOL_SOCKET	Enable keepalives	Detect dead peers; important for idle connections
SO_LINGER	SOL_SOCKET	Linger on close	Controls close() behavior with pending data
TCP_NODELAY	IPPROTO_TCP	Disable Nagle	Reduces latency for small writes (interactive apps)
TCP_CORK	IPPROTO_TCP	Cork output	Batches small writes; opposite of NODELAY
TCP_QUICKACK	IPPROTO_TCP	Disable delayed ACK	Send ACKs immediately; reduces latency
IP_TOS	IPPROTO_IP	Type of Service	Sets DSCP/ECN bits for QoS

Socket options implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
/**
 * sock_common_setsockopt - Generic setsockopt implementation
 * 
 * This function dispatches to the appropriate handler based
 * on the option level (SOL_SOCKET vs protocol-specific).
 */
int sock_common_setsockopt(struct socket *sock, int level,
                           int optname, sockptr_t optval,
                           unsigned int optlen)
{
    struct sock *sk = sock->sk;
 
    /* SOL_SOCKET options are handled generically */
    if (level == SOL_SOCKET)
        return sock_setsockopt(sock, level, optname, optval, optlen);
 
    /* Delegate to protocol-specific handler */
    return sk->sk_prot->setsockopt(sk, level, optname, optval, optlen);
}
 
/**
 * sock_setsockopt - Handle SOL_SOCKET level options
 */
int sock_setsockopt(struct socket *sock, int level, int optname,
                    sockptr_t optval, unsigned int optlen)
{
    struct sock *sk = sock->sk;
    int val;
    int ret = 0;
 
    if (copy_from_sockptr(&val, optval, sizeof(val)))
        return -EFAULT;
 
    lock_sock(sk);
 
    switch (optname) {
    case SO_REUSEADDR:
        sk->sk_reuse = (val ? SK_CAN_REUSE : SK_NO_REUSE);
        break;
 
    case SO_REUSEPORT:
        sk->sk_reuseport = val ? 1 : 0;
        break;
 
    case SO_RCVBUF:
        /* Clamp to system limits */
        val = min_t(u32, val, sysctl_rmem_max);
        val = min_t(int, val, INT_MAX / 2);
        
        /* Kernel doubles the value (internal overhead accounting) */
        WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF));
        break;
 
    case SO_SNDBUF:
        val = min_t(u32, val, sysctl_wmem_max);
        val = min_t(int, val, INT_MAX / 2);
        
        sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
        WRITE_ONCE(sk->sk_sndbuf, max_t(int, val * 2, SOCK_MIN_SNDBUF));
        break;
 
    case SO_KEEPALIVE:
        if (sk->sk_prot->keepalive)
            sk->sk_prot->keepalive(sk, val);
        sock_valbool_flag(sk, SOCK_KEEPOPEN, val);
        break;
 
    case SO_LINGER:
        /* Handle struct linger instead of int */
        /* ... special handling ... */
        break;
 
    /* ... many more options ... */
    
    default:
        ret = -ENOPROTOOPT;
        break;
    }
 
    release_sock(sk);
    return ret;
}

Buffer Sizing Best Practices

For high-bandwidth applications, socket buffers should be sized to handle the bandwidth-delay product (BDP = bandwidth × RTT). For a 10 Gbps link with 50ms RTT, BDP is ~62.5 MB. The kernel's autotuning (net.ipv4.tcp_moderate_rcvbuf) handles this for most cases, but manual tuning may be needed for extreme workloads.

Summary: Mastering the Socket Layer

The Linux socket layer is a masterfully designed abstraction that has scaled from simple client-server applications to handling millions of concurrent connections in modern cloud infrastructure. Understanding its architecture is essential for anyone building high-performance networked systems.

Key Takeaways

•Dual-structure design — struct socket provides the BSD-compatible API while struct sock contains protocol-specific implementation, enabling clean separation of concerns
•Protocol independence — The address family and protocol operations abstractions allow the same application code to work across TCP, UDP, Unix sockets, and other protocols
•Operations tables — Function pointer tables (proto_ops, proto) implement polymorphic behavior, enabling protocol extensibility without core code changes
•Queue management — Multiple carefully designed queues (receive, write, backlog, error) coordinate between application I/O and network events with proper locking
•Memory accounting — Socket buffer limits and kernel memory tracking prevent individual sockets from exhausting system resources
•Extensive configurability — Socket options provide fine-grained control over buffering, timing, and protocol behavior for application-specific optimization

What's next:

With the socket layer understood, we'll dive deeper into the Linux networking stack. The next page explores the protocol stack architecture—how layers from socket to device driver are organized, how packets flow between layers, and the critical structures and functions that implement TCP/IP networking in Linux.

Page Complete

You now understand the Linux socket layer architecture—the bridge between application network I/O and the kernel's protocol implementations. This foundation is essential for understanding performance characteristics, debugging networking issues, and building systems that can scale to handle massive workloads.

1 / 5

Loading learning content...

Operating SystemsLinux Internals

Linux Networking

LevelAdvanced

Duration180 mins

TopicLinux Internals

1 / 5

Socket Layer

The Universal Network Interface

What You Will Learn

Socket Abstraction Architecture

The fundamental design philosophy:

The socket layer implements the object-oriented principle despite being written in C. Each socket is an object with:

State: Connection status, pending data, configuration options
Behavior: Protocol-specific operations (connect, accept, send, receive)
Identity: File descriptor, address bindings, peer information

This design allows applications to use the same system calls regardless of whether they're communicating over TCP/IP, UDP, Unix domain sockets, or exotic protocols like Bluetooth or CAN bus.

Socket Layer Hierarchy
Layer	Representation	Purpose	Key Data Structures
User Space	File Descriptor (int)	Application handle for I/O operations	fd, FILE*
VFS Layer	struct file	Unified file abstraction	f_op, private_data
Socket Layer	struct socket	Protocol-independent socket operations	ops, sk, type, state
Protocol Layer	struct sock	Protocol-specific implementation	sk_prot, sk_receive_queue
Network Layer	struct sk_buff	Packet buffer representation	data, len, protocol headers

Understanding the two-structure design:

Linux employs a deliberate two-structure approach for sockets:

struct socket: The BSD-compatible abstraction that provides the user-facing interface. This structure is relatively small and contains generic socket state.
struct sock: The network-layer representation that holds protocol-specific state and data queues. This structure is much larger and contains the actual networking machinery.

struct socket (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
/**
 * struct socket - The user-visible socket structure
 * 
 * This structure represents the user-space view of a socket.
 * It contains the minimum state needed for file operations
 * and references the protocol-specific struct sock.
 */
struct socket {
    socket_state            state;      /* Socket state (SS_*) */
    short                   type;       /* SOCK_STREAM, SOCK_DGRAM, etc. */
    unsigned long           flags;      /* Socket flags (SOCK_NOSPACE, etc.) */
    
    struct file             *file;      /* Back pointer to file structure */
    struct sock             *sk;        /* Protocol-specific socket structure */
    const struct proto_ops  *ops;       /* Protocol operations table */
    
    struct socket_wq        wq;         /* Wait queue for async notifications */
};
 
/* Socket states */
typedef enum {
    SS_FREE = 0,            /* Not allocated */
    SS_UNCONNECTED,         /* Unconnected to any peer */
    SS_CONNECTING,          /* In process of connecting */
    SS_CONNECTED,           /* Connected to peer */
    SS_DISCONNECTING        /* In process of disconnecting */
} socket_state;

Why Two Structures?

The struct sock Deep Dive

The structure's organization reflects its responsibilities:

Connection state: Source/destination addresses, ports, connection status
Data queues: Receive queue, send queue, backlog queue, error queue
Buffer management: Memory accounting, socket buffer limits
Timer management: Retransmission timers, keepalive timers
Protocol callbacks: Function pointers for protocol-specific behavior

struct sock (key fields)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
/**
 * struct sock - Network layer representation of sockets
 * 
 * This is the protocol-agnostic base structure that all
 * protocol-specific socket structures embed or extend.
 */
struct sock {
    /*
     * Cache line organization is critical for performance.
     * Frequently accessed fields are grouped together.
     */
    
    /* First cache line: Hot path fields */
    struct sock_common      __sk_common;    /* Shared with inet_timewait_sock */
    
    /* Receive queue - incoming packets waiting for recv() */
    struct sk_buff_head     sk_receive_queue;
    
    /* Write queue - packets scheduled for transmission */
    struct sk_buff_head     sk_write_queue;
    
    /* Error queue - ICMP errors, timestamps, etc. */
    struct sk_buff_head     sk_error_queue;
    
    /* Backlog queue - packets received during user processing */
    struct sk_buff_head     sk_backlog;
    
    /* Memory accounting */
    atomic_t                sk_rmem_alloc;      /* Receive buffer usage */
    atomic_t                sk_wmem_alloc;      /* Write buffer usage */
    atomic_t                sk_omem_alloc;      /* Optional memory usage */
    
    /* Buffer limits (set via setsockopt) */
    int                     sk_sndbuf;          /* Send buffer size */
    int                     sk_rcvbuf;          /* Receive buffer size */
    
    /* Socket flags and options */
    unsigned long           sk_flags;           /* SO_KEEPALIVE, etc. */
    unsigned int            sk_shutdown;        /* RCV/SEND shutdown flags */
    
    /* Protocol operations */
    struct proto            *sk_prot;           /* Protocol callbacks */
    struct proto            *sk_prot_creator;   /* Creator protocol */
    
    /* Wait queue for blocking operations */
    struct socket_wq __rcu  *sk_wq;
    
    /* Timer for various purposes (retransmit, keepalive) */
    struct timer_list       sk_timer;
    
    /* Timestamps */
    ktime_t                 sk_stamp;           /* Last packet timestamp */
    
    /* ... many more fields ... */
};
 
/* Embedded common structure for hash table lookups */
struct sock_common {
    union {
        __addrpair      skc_addrpair;       /* Foreign/local IPv4 addresses */
        struct {
            __be32      skc_daddr;          /* Foreign IPv4 address */
            __be32      skc_rcv_saddr;      /* Bound local IPv4 address */
        };
    };
    union {
        __portpair      skc_portpair;       /* Foreign/local ports */
        struct {
            __be16      skc_dport;          /* Foreign port */
            __u16       skc_num;            /* Local port */
        };
    };
    unsigned short      skc_family;         /* Address family (AF_INET, etc.) */
    volatile unsigned char skc_state;       /* Connection state */
    unsigned char       skc_reuse;          /* SO_REUSEADDR, SO_REUSEPORT */
    int                 skc_bound_dev_if;   /* Bound device index */
};

Cache Line Optimization

Protocol-specific socket extensions:

/* TCP socket - extends sock with TCP-specific state */
struct tcp_sock {
    struct inet_connection_sock inet_conn;  /* Contains struct sock */
    
    /* TCP-specific fields */
    u32     snd_una;        /* First unacknowledged byte */
    u32     snd_nxt;        /* Next byte to send */
    u32     rcv_nxt;        /* Expected next receive sequence */
    u32     snd_wnd;        /* Send window size */
    u32     rcv_wnd;        /* Receive window size */
    /* ... hundreds more TCP-specific fields ... */
};

Key struct sock Responsibilities

•Connection Identity — Stores the 5-tuple (protocol, source IP, source port, destination IP, destination port) that uniquely identifies a connection
•Buffer Management — Manages receive and send queues with memory accounting to enforce SO_RCVBUF and SO_SNDBUF limits
•State Machine — Tracks connection state (ESTABLISHED, CLOSE_WAIT, TIME_WAIT, etc.) for connection-oriented protocols
•Wait Queue Management — Coordinates blocking I/O operations with the kernel's sleep/wakeup mechanism
•Timer Coordination — Manages protocol timers for retransmission, keepalive, delayed ACK, and other time-based events
•Reference Counting — Ensures the socket isn't freed while in use by multiple kernel subsystems

Socket System Call Interface

The core socket system calls form two groups:

Setup calls: socket(), bind(), listen(), accept(), connect()
Data transfer calls: send(), recv(), sendto(), recvfrom(), sendmsg(), recvmsg()

Each system call traverses a well-defined path through the kernel, from the syscall entry point through VFS, socket layer, and finally to the protocol implementation.

Socket System Calls and Their Kernel Implementation
System Call	Kernel Function	Purpose	Key Operations
socket()	__sys_socket()	Create new socket	Allocate struct socket, call protocol create()
bind()	__sys_bind()	Assign local address	Validate address, call protocol bind()
listen()	__sys_listen()	Mark as passive socket	Allocate accept queue, call protocol listen()
accept()	__sys_accept4()	Accept connection	Dequeue from accept queue, create new socket
connect()	__sys_connect()	Active connection open	Initiate handshake, call protocol connect()
send()/sendto()	__sys_sendto()	Transmit data	Copy to kernel, call protocol sendmsg()
recv()/recvfrom()	__sys_recvfrom()	Receive data	Dequeue from receive queue, copy to user
close()	sock_close()	Release socket	Teardown connection, free resources

socket() system call implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
/**
 * __sys_socket - Create a new socket
 * @family: Protocol family (AF_INET, AF_UNIX, etc.)
 * @type: Socket type (SOCK_STREAM, SOCK_DGRAM, etc.)
 * @protocol: Protocol number (usually 0 for default)
 *
 * This is the kernel implementation of the socket() system call.
 * It creates a new socket structure and associates it with a file descriptor.
 */
int __sys_socket(int family, int type, int protocol)
{
    struct socket *sock;
    int flags, retval;
 
    /* Extract type flags (SOCK_NONBLOCK, SOCK_CLOEXEC) */
    flags = type & ~SOCK_TYPE_MASK;
    type &= SOCK_TYPE_MASK;
 
    /* Validate socket type */
    if (type < 0 || type >= SOCK_MAX)
        return -EINVAL;
 
    /* Create the socket structure */
    retval = sock_create(family, type, protocol, &sock);
    if (retval < 0)
        return retval;
 
    /*
     * Map the socket to a file descriptor.
     * This creates struct file and allocates an fd.
     */
    return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
}
 
/**
 * sock_create - Allocate and initialize a socket
 * 
 * This function finds the appropriate protocol family handler
 * and calls its create() method to initialize the socket.
 */
int sock_create(int family, int type, int protocol, struct socket **res)
{
    struct socket *sock;
    const struct net_proto_family *pf;
    int err;
 
    /* Allocate socket structure */
    sock = sock_alloc();
    if (!sock)
        return -ENFILE;
 
    sock->type = type;
 
    /* Find and call protocol family handler */
    pf = rcu_dereference(net_families[family]);
    if (!pf || !pf->create)
        goto out_release;
 
    /*
     * Call protocol-specific create function.
     * For AF_INET/SOCK_STREAM, this eventually calls tcp_v4_init_sock().
     */
    err = pf->create(current->nsproxy->net_ns, sock, protocol, 0);
    if (err < 0)
        goto out_release;
 
    *res = sock;
    return 0;
 
out_release:
    sock_release(sock);
    return err;
}

The flow of a connect() call:

To understand how system calls traverse the socket layer, let's trace a connect() call on a TCP socket:

User space: Application calls connect(fd, addr, addrlen)
Syscall entry: Kernel enters __sys_connect()
FD lookup: sockfd_lookup_light() finds struct socket from fd
Address copy: move_addr_to_kernel() copies sockaddr from user space
Security check: LSM hooks (SELinux, AppArmor) validate the operation
Protocol dispatch: sock->ops->connect() is called
For TCP: inet_stream_connect() → tcp_v4_connect() → tcp_connect()
Return to user: Connection initiated (or error returned)

This layered dispatch pattern is the key to socket layer extensibility—new protocols can be added by registering handlers without modifying core socket code.

Address Family Abstraction

Address family registration:

Protocol family registration
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
/**
 * struct net_proto_family - Protocol family definition
 * 
 * Each address family (AF_INET, AF_UNIX, etc.) provides
 * one of these structures to define its socket creation handler.
 */
struct net_proto_family {
    int             family;     /* Address family number */
    int             (*create)(struct net *net, struct socket *sock,
                              int protocol, int kern);
    struct module   *owner;     /* Module providing this family */
};
 
/* Example: IPv4 family registration */
static const struct net_proto_family inet_family_ops = {
    .family = AF_INET,
    .create = inet_create,      /* Called for socket(AF_INET, ...) */
    .owner  = THIS_MODULE,
};
 
/* Example: Unix domain socket family */
static const struct net_proto_family unix_family_ops = {
    .family = AF_UNIX,
    .create = unix_create,
    .owner  = THIS_MODULE,
};
 
/* Registration during module init */
static int __init inet_init(void)
{
    /* ... initialization ... */
    
    (void)sock_register(&inet_family_ops);
    
    /* ... more initialization ... */
}
 
/**
 * inet_create - Create an INET family socket
 * 
 * This function handles socket(AF_INET, type, protocol) calls.
 * It determines the correct protocol handler (TCP, UDP, RAW, etc.)
 * and initializes the socket accordingly.
 */
static int inet_create(struct net *net, struct socket *sock,
                       int protocol, int kern)
{
    struct inet_protosw *answer;
    struct proto *answer_prot;
    struct sock *sk;
    int err;
 
    /* Find protocol switch entry for this type/protocol combination */
    list_for_each_entry_rcu(answer, &inetsw[sock->type], list) {
        if (protocol == answer->protocol ||
            protocol == IPPROTO_IP) {
            /* Found matching protocol handler */
            break;
        }
    }
 
    /* Setup socket operations table */
    sock->ops = answer->ops;            /* proto_ops for this type */
    answer_prot = answer->prot;         /* struct proto for this protocol */
 
    /* Allocate struct sock (protocol-specific) */
    sk = sk_alloc(net, PF_INET, GFP_KERNEL, answer_prot, kern);
    if (!sk)
        return -ENOBUFS;
 
    /* Initialize the socket */
    sock_init_data(sock, sk);
    
    /* Call protocol-specific init (tcp_v4_init_sock, udp_init_sock, etc.) */
    if (sk->sk_prot->init) {
        err = sk->sk_prot->init(sk);
        if (err)
            goto out_free;
    }
 
    return 0;
 
out_free:
    sk_free(sk);
    return err;
}

The sockaddr abstraction:

Address families use different address structures, but all share a common header that allows the kernel to determine the family before interpreting the rest:

/* Generic socket address (minimum required) */
struct sockaddr {
    sa_family_t sa_family;    /* Address family */
    char        sa_data[14];  /* Protocol-specific address */
};

/* IPv4 address */
struct sockaddr_in {
    sa_family_t     sin_family;   /* AF_INET */
    __be16          sin_port;     /* Port number */
    struct in_addr  sin_addr;     /* IPv4 address */
    unsigned char   __pad[8];     /* Padding to sockaddr size */
};

/* IPv6 address */
struct sockaddr_in6 {
    sa_family_t     sin6_family;   /* AF_INET6 */
    __be16          sin6_port;     /* Port number */
    __be32          sin6_flowinfo; /* IPv6 flow info */
    struct in6_addr sin6_addr;     /* IPv6 address */
    __u32           sin6_scope_id; /* Scope ID */
};

/* Unix domain socket address */
struct sockaddr_un {
    sa_family_t sun_family;        /* AF_UNIX */
    char        sun_path[108];     /* Pathname */
};

The kernel checks sa_family and then casts to the appropriate structure for that family. This simple pattern enables the single API to support dramatically different addressing schemes.

sockaddr_storage: The Universal Container

Protocol Operations Table

Two levels of operations:

struct proto_ops: User-facing operations attached to struct socket. These correspond directly to socket system calls and perform validation, state checks, and user/kernel data copying before delegating to protocol-specific code.
struct proto: Network layer operations attached to struct sock. These implement the core protocol logic—connection establishment, data transmission, congestion control, etc.

Protocol operations tables
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
/**
 * struct proto_ops - Socket operation table
 * 
 * These operations are attached to struct socket and implement
 * the BSD socket API. They handle user-kernel interface concerns.
 */
struct proto_ops {
    int             family;
    struct module   *owner;
    
    /* Socket lifecycle */
    int (*release)(struct socket *sock);
    int (*bind)(struct socket *sock, struct sockaddr *addr, int addrlen);
    
    /* Connection management */
    int (*connect)(struct socket *sock, struct sockaddr *addr, 
                   int addrlen, int flags);
    int (*accept)(struct socket *sock, struct socket *newsock, int flags);
    int (*listen)(struct socket *sock, int backlog);
    
    /* Name/Address operations */
    int (*getname)(struct socket *sock, struct sockaddr *addr, int peer);
    
    /* Data transfer */
    int (*sendmsg)(struct socket *sock, struct msghdr *msg, size_t len);
    int (*recvmsg)(struct socket *sock, struct msghdr *msg, size_t len,
                   int flags);
    
    /* Options */
    int (*setsockopt)(struct socket *sock, int level, int optname,
                      sockptr_t optval, unsigned int optlen);
    int (*getsockopt)(struct socket *sock, int level, int optname,
                      char __user *optval, int __user *optlen);
    
    /* Event notification */
    __poll_t (*poll)(struct file *file, struct socket *sock,
                     poll_table *wait);
    
    /* ... additional operations ... */
};
 
/* TCP socket operations (for SOCK_STREAM over AF_INET) */
const struct proto_ops inet_stream_ops = {
    .family         = PF_INET,
    .owner          = THIS_MODULE,
    .release        = inet_release,
    .bind           = inet_bind,
    .connect        = inet_stream_connect,
    .accept         = inet_accept,
    .listen         = inet_listen,
    .getname        = inet_getname,
    .sendmsg        = inet_sendmsg,
    .recvmsg        = inet_recvmsg,
    .poll           = tcp_poll,
    .setsockopt     = sock_common_setsockopt,
    .getsockopt     = sock_common_getsockopt,
    /* ... */
};
 
/**
 * struct proto - Protocol handler operations
 * 
 * These operations implement core protocol logic and are attached
 * to struct sock. They handle the actual networking work.
 */
struct proto {
    char            name[32];
    struct module   *owner;
    
    /* Socket lifecycle */
    int  (*init)(struct sock *sk);
    void (*destroy)(struct sock *sk);
    void (*close)(struct sock *sk, long timeout);
    
    /* Connection management */
    int  (*connect)(struct sock *sk, struct sockaddr *addr, int addrlen);
    int  (*disconnect)(struct sock *sk, int flags);
    struct sock *(*accept)(struct sock *sk, int flags, int *err);
    
    /* Data transfer */
    int  (*sendmsg)(struct sock *sk, struct msghdr *msg, size_t len);
    int  (*recvmsg)(struct sock *sk, struct msghdr *msg, size_t len,
                    int noblock, int flags, int *addr_len);
    
    /* Packet reception (called from bottom half) */
    void (*data_ready)(struct sock *sk);
    
    /* Memory management */
    atomic_t        memory_allocated;      /* Protocol memory usage */
    int             memory_pressure;       /* Memory pressure indicator */
    
    /* Sysctl tunables */
    int             *sysctl_wmem;          /* Write buffer sysctl */
    int             *sysctl_rmem;          /* Read buffer sysctl */
    
    /* ... many more operations ... */
};
 
/* TCP protocol handler */
struct proto tcp_prot = {
    .name           = "TCP",
    .owner          = THIS_MODULE,
    .init           = tcp_v4_init_sock,
    .close          = tcp_close,
    .connect        = tcp_v4_connect,
    .disconnect     = tcp_disconnect,
    .accept         = inet_csk_accept,
    .sendmsg        = tcp_sendmsg,
    .recvmsg        = tcp_recvmsg,
    /* ... */
};

The dispatch flow:

When an application calls send() on a TCP socket, the execution path is:

__sys_sendto() — System call entry
sock_sendmsg() — Generic socket layer
sock->ops->sendmsg() → inet_sendmsg() — INET layer
sk->sk_prot->sendmsg() → tcp_sendmsg() — TCP implementation
TCP code builds segments, manages congestion, queues for transmission

This chain of indirection costs some CPU cycles, but it enables the clean separation that makes Linux networking so maintainable and extensible.

Optimization: Direct Call Paths

Socket Buffers and Queues

The key queues in struct sock:

Socket Layer Queues

•sk_receive_queue — Packets ready for the application to read. When the network stack delivers a complete TCP segment or UDP datagram, it's placed here. The recv() family of calls dequeue from this.
•sk_write_queue — Packets waiting to be transmitted. TCP uses this to hold segments until acknowledgment; UDP packets are queued briefly before transmission.
•sk_backlog — Packets that arrived while the application held the socket lock. Processed when the application releases the lock.
•sk_error_queue — Asynchronous errors and ancillary data (timestamps, ICMP errors). Retrieved via MSG_ERRQUEUE flag.
•accept queue (TCP) — Completed connections waiting for accept(). Actually two queues: SYN queue (incomplete) and accept queue (complete).

Socket queue operations
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
/**
 * sk_buff_head - Queue of socket buffers
 * 
 * This doubly-linked list structure manages packet queues.
 * It includes a spinlock for concurrent access protection.
 */
struct sk_buff_head {
    struct sk_buff  *next;
    struct sk_buff  *prev;
    __u32           qlen;       /* Number of buffers in queue */
    spinlock_t      lock;       /* Queue lock */
};
 
/* Add packet to end of receive queue */
void skb_queue_tail(struct sk_buff_head *list, struct sk_buff *skb)
{
    unsigned long flags;
    
    spin_lock_irqsave(&list->lock, flags);
    __skb_queue_tail(list, skb);
    spin_unlock_irqrestore(&list->lock, flags);
}
 
/* Remove and return first packet from receive queue */
struct sk_buff *skb_dequeue(struct sk_buff_head *list)
{
    struct sk_buff *skb;
    unsigned long flags;
    
    spin_lock_irqsave(&list->lock, flags);
    skb = __skb_dequeue(list);
    spin_unlock_irqrestore(&list->lock, flags);
    
    return skb;
}
 
/**
 * Memory accounting for socket queues
 * 
 * The kernel tracks memory used by each socket to enforce
 * buffer limits (SO_RCVBUF, SO_SNDBUF) and system-wide limits.
 */
 
/* Charge memory to receive buffer */
int sk_rmem_schedule(struct sock *sk, struct sk_buff *skb, int size)
{
    /* Check if socket receive buffer has room */
    if (!sk_has_account(sk))
        return 1;
    
    return size <= sk->sk_rcvbuf - atomic_read(&sk->sk_rmem_alloc) ||
           sk_force_memory_schedule(sk);
}
 
/* Called when skb is queued to receive queue */
void skb_set_owner_r(struct sk_buff *skb, struct sock *sk)
{
    skb->sk = sk;
    skb->destructor = sock_rfree;
    atomic_add(skb->truesize, &sk->sk_rmem_alloc);
}
 
/* Called when skb is freed (data consumed by application) */
void sock_rfree(struct sk_buff *skb)
{
    struct sock *sk = skb->sk;
    
    atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
    
    /* Wake writers if space available */
    if (sock_writeable(sk))
        sk->sk_data_ready(sk);
}

The backlog queue and locking:

Socket processing faces a threading challenge: packets can arrive (in softirq context) while the application is actively reading (in process context). The sk_backlog queue solves this elegantly:

Application calls recv() and acquires sk->sk_lock.slock
Packet arrives in softirq; tries to acquire lock
Since lock is held, packet is queued to sk_backlog
When application releases lock, it processes backlog
This avoids spinlock contention while ensuring packets aren't lost

The backlog mechanism is crucial for performance—without it, packet processing would block on application activity, or vice versa.

Buffer Exhaustion

Socket Options and Configuration

Socket behavior is extensively configurable through the setsockopt() and getsockopt() system calls. Options are organized by level—the layer of the networking stack that implements them:

SOL_SOCKET: Generic socket layer options
IPPROTO_IP: IPv4 layer options
IPPROTO_IPV6: IPv6 layer options
IPPROTO_TCP: TCP-specific options
IPPROTO_UDP: UDP-specific options

Essential Socket Options
Option	Level	Purpose	Impact
SO_REUSEADDR	SOL_SOCKET	Allow address reuse	Enables server restart without TIME_WAIT delay
SO_REUSEPORT	SOL_SOCKET	Allow port sharing	Multiple sockets can bind same port (load balancing)
SO_RCVBUF	SOL_SOCKET	Receive buffer size	Controls maximum pending data; affects throughput
SO_SNDBUF	SOL_SOCKET	Send buffer size	Controls outgoing queue depth; affects throughput
SO_KEEPALIVE	SOL_SOCKET	Enable keepalives	Detect dead peers; important for idle connections
SO_LINGER	SOL_SOCKET	Linger on close	Controls close() behavior with pending data
TCP_NODELAY	IPPROTO_TCP	Disable Nagle	Reduces latency for small writes (interactive apps)
TCP_CORK	IPPROTO_TCP	Cork output	Batches small writes; opposite of NODELAY
TCP_QUICKACK	IPPROTO_TCP	Disable delayed ACK	Send ACKs immediately; reduces latency
IP_TOS	IPPROTO_IP	Type of Service	Sets DSCP/ECN bits for QoS

Socket options implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
/**
 * sock_common_setsockopt - Generic setsockopt implementation
 * 
 * This function dispatches to the appropriate handler based
 * on the option level (SOL_SOCKET vs protocol-specific).
 */
int sock_common_setsockopt(struct socket *sock, int level,
                           int optname, sockptr_t optval,
                           unsigned int optlen)
{
    struct sock *sk = sock->sk;
 
    /* SOL_SOCKET options are handled generically */
    if (level == SOL_SOCKET)
        return sock_setsockopt(sock, level, optname, optval, optlen);
 
    /* Delegate to protocol-specific handler */
    return sk->sk_prot->setsockopt(sk, level, optname, optval, optlen);
}
 
/**
 * sock_setsockopt - Handle SOL_SOCKET level options
 */
int sock_setsockopt(struct socket *sock, int level, int optname,
                    sockptr_t optval, unsigned int optlen)
{
    struct sock *sk = sock->sk;
    int val;
    int ret = 0;
 
    if (copy_from_sockptr(&val, optval, sizeof(val)))
        return -EFAULT;
 
    lock_sock(sk);
 
    switch (optname) {
    case SO_REUSEADDR:
        sk->sk_reuse = (val ? SK_CAN_REUSE : SK_NO_REUSE);
        break;
 
    case SO_REUSEPORT:
        sk->sk_reuseport = val ? 1 : 0;
        break;
 
    case SO_RCVBUF:
        /* Clamp to system limits */
        val = min_t(u32, val, sysctl_rmem_max);
        val = min_t(int, val, INT_MAX / 2);
        
        /* Kernel doubles the value (internal overhead accounting) */
        WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF));
        break;
 
    case SO_SNDBUF:
        val = min_t(u32, val, sysctl_wmem_max);
        val = min_t(int, val, INT_MAX / 2);
        
        sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
        WRITE_ONCE(sk->sk_sndbuf, max_t(int, val * 2, SOCK_MIN_SNDBUF));
        break;
 
    case SO_KEEPALIVE:
        if (sk->sk_prot->keepalive)
            sk->sk_prot->keepalive(sk, val);
        sock_valbool_flag(sk, SOCK_KEEPOPEN, val);
        break;
 
    case SO_LINGER:
        /* Handle struct linger instead of int */
        /* ... special handling ... */
        break;
 
    /* ... many more options ... */
    
    default:
        ret = -ENOPROTOOPT;
        break;
    }
 
    release_sock(sk);
    return ret;
}

Buffer Sizing Best Practices

Summary: Mastering the Socket Layer

Key Takeaways

•Dual-structure design — struct socket provides the BSD-compatible API while struct sock contains protocol-specific implementation, enabling clean separation of concerns
•Protocol independence — The address family and protocol operations abstractions allow the same application code to work across TCP, UDP, Unix sockets, and other protocols
•Operations tables — Function pointer tables (proto_ops, proto) implement polymorphic behavior, enabling protocol extensibility without core code changes
•Queue management — Multiple carefully designed queues (receive, write, backlog, error) coordinate between application I/O and network events with proper locking
•Memory accounting — Socket buffer limits and kernel memory tracking prevent individual sockets from exhausting system resources
•Extensive configurability — Socket options provide fine-grained control over buffering, timing, and protocol behavior for application-specific optimization

What's next:

Page Complete

1 / 5