Container Fundamentals - Learning Module

Loading content...

0/273

Container Images

The Blueprint of Containerized Software

A container image is to a container what a class is to an object in object-oriented programming: it's the template from which running instances are created. But unlike simple templates, container images are sophisticated artifacts that carry an entire filesystem, metadata, configuration, and the complete execution environment for an application.

Container images are the most important artifact in modern software delivery. They represent the single source of truth for what gets deployed—the immutable, versioned package that travels unchanged from a developer's laptop through CI/CD pipelines to production clusters across the globe.

Understanding images deeply—their structure, layering mechanism, content addressing, and lifecycle—is essential for anyone architecting containerized systems.

What You Will Learn

By the end of this page, you will understand the internal structure of container images, how content-addressable storage enables efficient distribution, image layering mechanics, tagging and versioning strategies, base image selection criteria, and best practices for creating secure, minimal, and reproducible images.

Anatomy of a Container Image

A container image is not a single file—it's a collection of filesystem layers plus metadata. Understanding this structure explains why images are efficient to store, transfer, and modify.

Image Components:

Container Image Components
Component	Description	Example
Filesystem Layers	Tar archives containing files/directories added at each build step	Base OS files, Python packages, application code
Image Configuration	JSON metadata describing how to run the container	Environment variables, entrypoint, exposed ports
Image Manifest	JSON listing all layers with their digests	Links layers to form the complete image
Layer Digests	SHA256 hashes uniquely identifying each layer	sha256:a3ed95cae...

Image Structure Visualization

Structure

Container Image: myapp:1.0
├── manifest.json                    # Lists all layers and config
│   {
│     "config": { "digest": "sha256:..." },
│     "layers": [
│       { "digest": "sha256:abc...", "size": 28558848 },
│       { "digest": "sha256:def...", "size": 16458752 },
│       { "digest": "sha256:ghi...", "size": 4194304 }
│     ]
│   }
│
├── config.json                      # Runtime configuration
│   {
│     "architecture": "amd64",
│     "os": "linux",
│     "config": {
│       "Env": ["PATH=/usr/local/bin:/usr/bin"],
│       "Cmd": ["python", "app.py"],
│       "WorkingDir": "/app",
│       "ExposedPorts": { "8080/tcp": {} }
│     },
│     "history": [...]               # Build history
│   }
│
└── layers/                          # Filesystem layers (tar.gz)
    ├── sha256:abc.../layer.tar.gz   # Base OS (~28 MB)
    ├── sha256:def.../layer.tar.gz   # Python packages (~16 MB)
    └── sha256:ghi.../layer.tar.gz   # Application code (~4 MB)

Content-Addressable Storage:

Every layer and configuration in a container image is identified by a cryptographic hash (SHA256 digest) of its contents. This is content-addressable storage, and it has profound implications:

Deduplication: Identical layers are stored only once, regardless of how many images reference them
Integrity verification: Any corruption or tampering changes the hash, immediately detectable
Cacheability: Layers can be cached at any level (local, registry, CDN) with confidence
Immutability: You cannot modify a layer—any change creates a new layer with a new hash

Understanding Digests vs Tags

Tags like 'nginx:1.25' are mutable pointers—they can be moved to point to different image digests. Digests like 'nginx@sha256:abc123...' are immutable references. For production deployments where reproducibility is critical, always reference images by digest, not just tag.

Layer Mechanics: How Layering Actually Works

Container image layering is more nuanced than simple stacking. Understanding the mechanics helps you optimize images and debug unexpected behaviors.

The Union Filesystem:

Container runtimes use a union filesystem (OverlayFS, AUFS, or similar) to merge multiple layers into a single coherent view. This is copy-on-write (CoW) at the filesystem level:

OverlayFS Mechanics

Diagram

Union Filesystem (OverlayFS on Linux)
======================================
 
┌─────────────────────────────────────────────────────────────┐
│                    Merged View (what container sees)         │
│  /app/main.py  /app/config.py  /etc/passwd  /bin/python     │
└─────────────────────────────────────────────────────────────┘
                              ▲
                              │ Merge
┌─────────────────────────────────────────────────────────────┐
│  Upper Layer (container writable layer)                      │
│  /app/config.py (modified)                                   │
│  /app/.logs/ (new directory)                                 │
│  /etc/hostname (whiteout - marks deletion)                   │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│  Lower Layer 3: Application code                             │
│  /app/main.py  /app/config.py (original)                     │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│  Lower Layer 2: Python runtime                               │
│  /usr/local/bin/python  /usr/local/lib/python3.11/          │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│  Base Layer: Alpine Linux                                    │
│  /bin/sh  /etc/passwd  /lib/apk/                            │
└─────────────────────────────────────────────────────────────┘
 
File Resolution:
Read /app/config.py → Found in Upper Layer → Return modified version
Read /app/main.py   → Not in Upper → Check Layer 3 → Found → Return
Read /bin/python    → Not in Upper/3 → Check Layer 2 → Found → Return

Whiteout Files and Deletions:

A critical concept often misunderstood: deleting a file doesn't reduce image size. When you delete a file in a later layer, the union filesystem creates a "whiteout" marker that hides the file, but the original still exists in the earlier layer.

This has important implications:

Layer Size Problem
Dockerfile
1
2
3
4
5
6
7
8
9
10
11
12
13
# BAD: File still exists in Layer 1, total size = 200 MB
FROM alpine
RUN wget http://example.com/large-100mb-file.tar.gz  # Layer 1: +100 MB
RUN tar -xzf large-100mb-file.tar.gz                  # Layer 2: +100 MB (extracted)
RUN rm large-100mb-file.tar.gz                        # Layer 3: +4 KB (whiteout only)
# Total: 200 MB (original archive still in Layer 1!)
 
# GOOD: Download, extract, and cleanup in single layer
FROM alpine
RUN wget http://example.com/large-100mb-file.tar.gz && \
    tar -xzf large-100mb-file.tar.gz && \
    rm large-100mb-file.tar.gz
# Total: ~100 MB (only extracted files remain)

The rm Trap

Never use a separate RUN command to remove files created in a previous layer. Always combine download/install/cleanup in a single RUN command. This is the most common cause of unexpectedly large images. Use 'docker history' to see the size contribution of each layer.

Layer sharing across images:

When multiple images share base layers, Docker stores them only once:

myapp-api:latest  ─┬─ [Application layer: 10 MB]
                   │
                   └─ [Node.js runtime: 200 MB]  ← Shared
                       │
myapp-web:latest  ─┬─ [Application layer: 15 MB]
                   │
                   └─ [Node.js runtime: 200 MB]  ← Same layer!

Total disk usage: 200 MB + 10 MB + 15 MB = 225 MB
(Not 200 + 10 + 200 + 15 = 425 MB)

This is why standardizing on base images across your organization dramatically reduces storage and network costs.

Image Tagging Strategies

Image tags are critical for deployment automation, rollback procedures, and system auditability. A well-designed tagging strategy prevents deployment confusion and enables reliable operations.

Common Tagging Patterns:

Image Tagging Strategies
Pattern	Example	Use Case	Pros/Cons
Semantic Version	myapp:1.2.3	Releases	Clear versioning; can be mutable
Git SHA	myapp:a3f8d2c	CI/CD pipelines	Immutable; traceable to commit; hard to read
Git SHA + Version	myapp:1.2.3-a3f8d2c	Production	Best of both; clearly versioned and traceable
Date/Timestamp	myapp:2024-01-15	Nightly builds	Time-ordered; hard to correlate to code
Branch	myapp:main	Development	Always latest; unsuitable for production
Environment	myapp:prod	Deployment shortcuts	Dangerous; tags get overwritten
latest	myapp:latest	Quick development	Ambiguous; never use in production

Recommended Production Strategy:

The most robust approach combines semantic versioning with git commit SHA, plus maintaining 'rolling' tags for convenience:

CI/CD Tagging Example
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/bin/bash
# In CI/CD pipeline after successful build and tests
 
# Get version from package.json, Cargo.toml, etc.
VERSION=$(npm pkg get version | tr -d '"')  # e.g., "2.3.1"
GIT_SHA=$(git rev-parse --short HEAD)        # e.g., "a3f8d2c"
BRANCH=$(git branch --show-current)          # e.g., "main"
 
# Full image reference
IMAGE="myregistry.com/myapp"
 
# Tag with multiple tags for flexibility
docker tag myapp:build "$IMAGE:$VERSION"              # myapp:2.3.1
docker tag myapp:build "$IMAGE:$VERSION-$GIT_SHA"     # myapp:2.3.1-a3f8d2c
docker tag myapp:build "$IMAGE:$GIT_SHA"              # myapp:a3f8d2c
 
# Optionally, for main branch, update 'latest' and major version tags
if [ "$BRANCH" = "main" ]; then
    docker tag myapp:build "$IMAGE:latest"
    docker tag myapp:build "$IMAGE:2"                 # Major version
    docker tag myapp:build "$IMAGE:2.3"               # Minor version
fi
 
# Push all tags
docker push "$IMAGE" --all-tags

Tagging Best Practices

•Never deploy 'latest' to production — It's impossible to know what's actually running or roll back reliably.
•Include git SHA for traceability — When debugging production issues, you need to know exactly which code is running.
•Use immutable tags in deployment manifests — Reference 'myapp:2.3.1-a3f8d2c' not 'myapp:2.3.1' to prevent drift.
•Consider digest pinning for critical workloads — 'myapp@sha256:...' cannot be changed or overwritten.
•Automate tagging in CI/CD — Manual tagging leads to inconsistency and human error.
•Keep a clear tag retention policy — Define how long to keep old tags to manage registry storage costs.

The Mutable Tag Danger

Tags are mutable by default—pushing 'myapp:1.0' twice overwrites the first image. This can cause production to run different code than expected if tags are reused. Either enforce immutable tags in your registry or always deploy with digests.

Base Image Selection: The Foundation Matters

Your base image choice affects image size, security posture, compatibility, and operational characteristics. This decision ripples through your entire container infrastructure.

Common Base Image Options:

Base Image Comparison
Base Image	Size	Package Manager	Best For
scratch	0 MB	None	Static binaries (Go, Rust)
alpine:3.18	~5 MB	apk	Minimal containers, most cases
distroless	~20 MB	None	Security-focused deployments
debian:bookworm-slim	~75 MB	apt	Compatibility with glibc apps
ubuntu:22.04	~78 MB	apt	Familiarity, broad package support
python:3.11-slim	~120 MB	apt + pip	Python applications
node:20-slim	~180 MB	apt + npm	Node.js applications

Deep Dive: Alpine Linux

Alpine is the most popular minimal base image, but it has important trade-offs:

Alpine Advantages

•Tiny size (~5 MB base)
•Fast downloads and startup
•Smaller attack surface
•Excellent package repository
•Widely used and tested

Alpine Considerations

•Uses musl libc (not glibc)
•Some Python packages fail to compile
•DNS resolution quirks in older versions
•Debugging tools must be installed
•Performance differences in some workloads

Deep Dive: Distroless Images

Google's distroless images contain only your application and its runtime dependencies—no shell, no package manager, no unnecessary utilities. This is the gold standard for production security:

Distroless Example
Dockerfile
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Multi-stage build with distroless runtime
FROM golang:1.21 AS builder
WORKDIR /app
COPY . .
RUN CGO_ENABLED=0 go build -o /app/main
 
# Distroless has NO shell, NO package manager
FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=builder /app/main /
USER nonroot:nonroot
ENTRYPOINT ["/main"]
 
# For debugging (has shell but defeats some security benefits)
# FROM gcr.io/distroless/static-debian12:debug

Choosing Your Base Image

Start with the smallest image that works for your application. For Go/Rust: use 'scratch' or 'distroless'. For Python/Node: use slim variants and test carefully with Alpine. If you encounter compatibility issues with Alpine, debian-slim is the reliable fallback. Document why you chose a particular base image.

Image Security: Building Defensible Images

Container image security is a critical concern—images often contain vulnerabilities, exposed secrets, or unnecessary attack surface. Building secure images is a skill that separates amateur containerization from professional practice.

The Security Mindset:

Minimize attack surface — Fewer packages = fewer vulnerabilities
Verify integrity — Know exactly what's in your images
Reduce privilege — Run as non-root, drop capabilities
Scan continuously — Vulnerabilities are discovered daily
Update regularly — Rebuild images when base images are updated

Common Image Security Mistakes

•Embedding secrets in images — Secrets in ENV or COPY become visible in image layers. Anyone with image access can extract them.
•Running as root — Default Docker behavior. Attackers who escape the container have root on the host.
•Using outdated base images — Base images with known CVEs. Vulnerability scanners flag these constantly.
•Installing unnecessary packages — vim, wget, curl, net-tools—common development conveniences that expand attack surface.
•Not pinning package versions — 'apt-get install nginx' might install a different version tomorrow.
•Ignoring multi-stage builds — Including build tools (compilers, dev dependencies) in production images.
•Using mutable tags — Pulling 'python:3.11' might get a different image each time.

Secure Dockerfile Pattern
Dockerfile
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# SECURITY-FOCUSED DOCKERFILE
 
# Pin exact base image version
FROM python:3.11.7-slim-bookworm@sha256:abc123...
 
# Set labels for image provenance
LABEL org.opencontainers.image.source="https://github.com/org/repo" \
      org.opencontainers.image.version="1.2.3" \
      org.opencontainers.image.vendor="MyCompany"
 
# Don't run as root
ARG UID=1000
ARG GID=1000
RUN groupadd --gid $GID appgroup && \
    useradd --uid $UID --gid $GID --shell /sbin/nologin appuser
 
WORKDIR /app
 
# Pin dependency versions in requirements.txt (or use poetry.lock)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt && \
    # Remove pip cache and unnecessary files
    rm -rf /root/.cache /var/cache/apt /var/lib/apt/lists/*
 
# Copy only necessary files
COPY --chown=appuser:appgroup src/ ./src/
 
# Switch to non-root user
USER appuser
 
# Minimal environment
ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1
 
# No secrets here! Inject at runtime
# ENV DATABASE_URL=  # WRONG!
 
EXPOSE 8080
 
# Health check without external dependencies
HEALTHCHECK --interval=30s --timeout=5s \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8080/health')"
 
CMD ["python", "-m", "gunicorn", "-b", "0.0.0.0:8080", "app:app"]

Vulnerability Scanning:

Integrate image scanning into your CI/CD pipeline. Popular tools include:

Image Scanning Tools
Tool	Type	Integration
Trivy	Open source, comprehensive	CLI, GitHub Actions, GitLab CI
Grype	Open source by Anchore	CLI, GitHub Actions
Snyk	Commercial with free tier	CLI, IDE, CI/CD, registry
Clair	Open source by CoreOS	API-based, registry integration
AWS ECR Scanning	Built into ECR	Automatic on push
Docker Scout	Docker's scanning solution	Docker Desktop, Docker Hub

Scan Results Require Action

A scan that finds vulnerabilities but triggers no action is security theater. Establish policies: HIGH vulnerabilities block deployment, MEDIUM must be fixed within 30 days, LOW must be tracked. Rebuild images when base images are patched.

Image Build Reproducibility

Reproducible builds ensure that building the same source code produces bit-for-bit identical images. This is harder than it sounds and crucial for security, debugging, and compliance.

Why Reproducibility Matters:

Security verification — Prove that a deployed image matches audited source code
Debugging — Recreate the exact environment where a bug occurred
Compliance — Demonstrate that production artifacts are traceable to reviewed code
Caching efficiency — Deterministic builds enable better layer caching

Sources of Non-Reproducibility

•Mutable base image tags — 'FROM python:3.11' might reference different images on different days.
•Unpinned package versions — 'RUN apt-get install nginx' installs whatever version is current.
•Timestamps in filesystems — File modification times embedded in layers change each build.
•Build ordering — Parallel downloads or builds may complete in different orders.
•Network-dependent steps — Resources fetched from the internet may change or disappear.
•Build machine differences — Different Docker versions or build context can affect output.

Achieving Reproducibility:

Reproducible Dockerfile
Dockerfile
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Pin base image by DIGEST (not just tag)
FROM python:3.11.7-slim-bookworm@sha256:a1b2c3d4e5f6...
 
# Pin package versions explicitly
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        libpq5=15.4-0+deb12u1 \
        ca-certificates=20230311 && \
    rm -rf /var/lib/apt/lists/*
 
# Use lock files for application dependencies
COPY requirements.lock .
RUN pip install --no-cache-dir -r requirements.lock
 
# If using npm, copy package-lock.json
# COPY package.json package-lock.json ./
# RUN npm ci  # Uses locked versions
 
# Set SOURCE_DATE_EPOCH for reproducible timestamps
ARG SOURCE_DATE_EPOCH
ENV SOURCE_DATE_EPOCH=${SOURCE_DATE_EPOCH:-0}
 
COPY . .

Software Bill of Materials (SBOM):

An SBOM documents all components in your image—packages, libraries, and their versions. SBOMs are increasingly required for compliance and enable automated vulnerability tracking.

Generating SBOM
Bash
1
2
3
4
5
6
7
8
9
10
11
# Generate SBOM with Syft
syft myapp:latest -o spdx-json > sbom.spdx.json
 
# Generate SBOM with Trivy
trivy image --format spdx-json myapp:latest > sbom.spdx.json
 
# Attach SBOM to image (using cosign)
cosign attach sbom --sbom sbom.spdx.json myregistry.com/myapp:1.0
 
# Scan SBOM for vulnerabilities
grype sbom:sbom.spdx.json

Image Signing

For high-security environments, sign your images cryptographically using tools like Cosign (part of Sigstore). Image signatures prove that the image came from a trusted source and hasn't been tampered with. Kubernetes can be configured to only run signed images.

Image Optimization Techniques

Smaller images mean faster pulls, faster deployments, lower storage costs, and better security. Here are advanced techniques for minimizing image size beyond basic multi-stage builds.

Size Optimization Strategies:

Advanced Optimization Techniques

•Use scratch for static binaries — Go, Rust, and C/C++ with static linking can use the empty scratch base (~0 MB overhead).
•Strip debug symbols — 'go build -ldflags="-s -w"' or 'strip --strip-all' can reduce binary size by 50%+.
•UPX compression — UPX can compress binaries by ~60% at the cost of startup decompression time.
•Squash layers — Docker experimental squash or external tools can merge layers, though this affects caching.
•Avoid cache directories — 'pip install --no-cache-dir', 'npm ci --cache /tmp', 'apt-get clean'.
•Use slim variants — Official images often have -slim versions with unnecessary packages removed.
•Remove documentation — 'rm -rf /usr/share/doc /usr/share/man', though it complicates debugging.

Minimal Go Image
Dockerfile
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Build stage
FROM golang:1.21-alpine AS builder
 
RUN apk add --no-cache upx
 
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
 
COPY . .
 
# Build with optimizations
RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 \
    go build -ldflags="-s -w" -o main . && \
    upx --best main
 
# Runtime stage - FROM SCRATCH (nothing!)
FROM scratch
 
# Copy CA certificates for HTTPS (if needed)
COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/
 
# Copy the compressed binary
COPY --from=builder /app/main /
 
ENTRYPOINT ["/main"]
 
# Final size: Often < 10 MB for a full web application

Analyzing Image Size:

Use tools to understand what's contributing to image size:

Size Analysis Commands
Bash
1
2
3
4
5
6
7
8
9
10
11
# View layer sizes
docker history myapp:latest
 
# Detailed layer analysis
docker history --no-trunc --format "{{.Size}}\t{{.CreatedBy}}" myapp:latest
 
# Interactive layer explorer (highly recommended)
dive myapp:latest
 
# Show total image size
docker images myapp:latest --format "{{.Size}}"

Use Dive for Deep Analysis

The 'dive' tool (github.com/wagoodman/dive) provides an interactive terminal UI that lets you explore each layer, see exactly which files were added/modified/deleted, and calculate wasted space. It's invaluable for optimizing Dockerfiles.

Summary: Mastering Container Images

Let's consolidate our understanding of container images:

Key Takeaways

•Images are layered filesystems plus metadata — Each layer is identified by a content-addressable SHA256 digest.
•Layers are shared and deduplicated — Identical layers are stored once, regardless of how many images use them.
•Deletions don't reduce size in earlier layers — Always delete temporary files in the same RUN command that creates them.
•Tags are mutable, digests are not — Use digest pinning for production deployments requiring strict reproducibility.
•Base image choice has cascading effects — Alpine is tiny but has compatibility quirks; distroless is secure but harder to debug.
•Security starts at image build time — Don't embed secrets, run as non-root, minimize packages, scan for vulnerabilities.
•Reproducibility requires explicit effort — Pin base images by digest, lock dependency versions, generate SBOMs.
•Size optimization compounds benefits — Smaller images mean faster everything: pulls, deployments, autoscaling, costs.

What's next:

With a deep understanding of container images, we'll explore Container Registries—the infrastructure that stores, distributes, and secures your images. You'll learn about public and private registries, authentication, access control, and registry operations at scale.

Page Complete

You now have comprehensive knowledge of container images—from their internal structure and layering mechanics to security practices and optimization techniques. This understanding is essential for building production-grade containerized systems.