Web Crawler - Learning Module

Loading content...

0/273

Content Extraction

From Raw HTML to Structured Data

Fetching a web page is only half the battle. The raw HTML returned by a server is a complex mix of content, navigation, advertisements, scripts, and boilerplate. Extracting the meaningful content and discovering links for further crawling is where real value is created.

Content extraction transforms raw bytes into structured, indexable information. Done well, it enables search engines to understand page topics, news aggregators to identify articles, and price comparison sites to extract product details. Done poorly, it produces garbage data that corrupts downstream systems.

What You Will Learn

By the end of this page, you will understand: (1) HTML parsing strategies and libraries, (2) Link extraction and URL resolution, (3) Main content extraction techniques, (4) Handling JavaScript-rendered content, (5) Metadata extraction (titles, descriptions, structured data), and (6) Dealing with diverse content types.

HTML Parsing Fundamentals

Real-world HTML is messy. Pages contain syntax errors, unclosed tags, and malformed markup. A robust parser must handle the worst while still extracting value.

Parser types:

DOM Parsers — Build complete document tree; memory-intensive but full access
SAX/Streaming Parsers — Event-driven; memory-efficient for large documents
Lenient Parsers — Handle malformed HTML gracefully (essential for web crawling)

Popular HTML Parsing Libraries
Library	Language	Type	Malformed HTML
cheerio	JavaScript/Node	DOM (jQuery-like)	Good
jsdom	JavaScript/Node	Full DOM	Excellent
BeautifulSoup	Python	DOM	Excellent
lxml	Python	DOM/SAX	Good
Jsoup	Java	DOM	Excellent
htmlparser2	JavaScript	SAX/streaming	Good

HTML Parsing Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import * as cheerio from 'cheerio';
 
class HTMLParser {
  private $: cheerio.CheerioAPI;
 
  constructor(html: string) {
    this.$ = cheerio.load(html, {
      decodeEntities: true,
      lowerCaseTags: true,
      lowerCaseAttributeNames: true,
    });
  }
 
  // Extract page title
  getTitle(): string {
    return this.$('title').first().text().trim() ||
           this.$('meta[property="og:title"]').attr('content') || '';
  }
 
  // Extract meta description
  getDescription(): string {
    return this.$('meta[name="description"]').attr('content') ||
           this.$('meta[property="og:description"]').attr('content') || '';
  }
 
  // Extract all links
  getLinks(): Array<{ href: string; text: string; rel: string }> {
    const links: Array<{ href: string; text: string; rel: string }> = [];
    
    this.$('a[href]').each((_, el) => {
      const $el = this.$(el);
      links.push({
        href: $el.attr('href') || '',
        text: $el.text().trim(),
        rel: $el.attr('rel') || ''
      });
    });
    
    return links;
  }
 
  // Extract canonical URL
  getCanonical(): string | null {
    return this.$('link[rel="canonical"]').attr('href') || null;
  }
}

Link Extraction and Resolution

Discovering new URLs is a core crawler function. Links appear in various forms and must be normalized before adding to the frontier.

Link sources in HTML:

<a href="..."> — Standard hyperlinks
<link rel="canonical" href="..."> — Canonical URL
<link rel="alternate" href="..."> — Alternate versions
<img src="...">, <script src="..."> — Resources (usually filtered)
<form action="..."> — Form targets (often excluded)
Inline JavaScript — Dynamic link generation
data-* attributes — Framework-specific patterns

Link Extractor
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
class LinkExtractor {
  private baseUrl: string;
  private $ : cheerio.CheerioAPI;
 
  constructor(html: string, baseUrl: string) {
    this.$ = cheerio.load(html);
    // Check for <base> tag which overrides base URL
    const baseTag = this.$('base[href]').attr('href');
    this.baseUrl = baseTag ? new URL(baseTag, baseUrl).href : baseUrl;
  }
 
  extractLinks(): Array<{ url: string; context: LinkContext }> {
    const links: Array<{ url: string; context: LinkContext }> = [];
    
    this.$('a[href]').each((_, el) => {
      const $el = this.$(el);
      const href = $el.attr('href');
      if (!href) return;
 
      const resolved = this.resolveUrl(href);
      if (!resolved) return;
 
      // Skip non-crawlable protocols
      if (!resolved.startsWith('http://') && !resolved.startsWith('https://')) {
        return;
      }
 
      links.push({
        url: resolved,
        context: {
          anchorText: $el.text().trim().slice(0, 200),
          rel: $el.attr('rel') || '',
          isNavigation: this.isNavigationLink($el),
          isNofollow: ($el.attr('rel') || '').includes('nofollow'),
        }
      });
    });
 
    return this.deduplicateLinks(links);
  }
 
  private resolveUrl(href: string): string | null {
    try {
      // Handle protocol-relative URLs
      if (href.startsWith('//')) {
        href = 'https:' + href;
      }
      return new URL(href, this.baseUrl).href;
    } catch {
      return null;  // Invalid URL
    }
  }
 
  private isNavigationLink($el: cheerio.Cheerio): boolean {
    // Check if link is in nav, header, footer
    return $el.closest('nav, header, footer, .navigation, .menu').length > 0;
  }
 
  private deduplicateLinks(links: Array<{ url: string; context: LinkContext }>) {
    const seen = new Set<string>();
    return links.filter(link => {
      if (seen.has(link.url)) return false;
      seen.add(link.url);
      return true;
    });
  }
}
 
interface LinkContext {
  anchorText: string;
  rel: string;
  isNavigation: boolean;
  isNofollow: boolean;
}

Respect nofollow

Links marked with rel="nofollow" indicate the site owner doesn't endorse the target. While you may still crawl these URLs, they should receive lower priority in ranking calculations. Similarly, sponsored and ugc (user-generated content) links carry different trust signals.

Main Content Extraction

Separating main content from boilerplate (headers, footers, sidebars, ads) is essential for both indexing and duplicate detection. Several approaches exist:

Extraction techniques:

Content Extraction Approaches

•DOM-based heuristics — Find the largest text-dense block; remove known boilerplate elements (nav, footer, aside)
•Text density analysis — Calculate text-to-HTML ratio; content areas have higher density than navigation
•Visual block analysis — Simulate rendering; identify visually prominent content blocks
•Machine learning — Train classifiers on labeled examples of content vs. boilerplate
•Readability algorithms — Mozilla's Readability library; industry-standard for article extraction

Content Extractor
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
class ContentExtractor {
  private $: cheerio.CheerioAPI;
 
  constructor(html: string) {
    this.$ = cheerio.load(html);
    this.removeBoilerplate();
  }
 
  private removeBoilerplate(): void {
    // Remove non-content elements
    this.$('script, style, noscript, iframe, svg').remove();
    this.$('header, footer, nav, aside').remove();
    this.$('[role="navigation"], [role="banner"], [role="contentinfo"]').remove();
    this.$('.advertisement, .ads, .sidebar, .comments').remove();
  }
 
  extractMainContent(): { text: string; wordCount: number } {
    // Find content candidates
    const candidates = this.findContentCandidates();
    
    // Score candidates by text density
    let bestCandidate = { element: this.$('body'), score: 0 };
    
    for (const candidate of candidates) {
      const score = this.scoreElement(candidate);
      if (score > bestCandidate.score) {
        bestCandidate = { element: candidate, score };
      }
    }
 
    const text = bestCandidate.element.text()
      .replace(/\s+/g, ' ')
      .trim();
 
    return {
      text,
      wordCount: text.split(/\s+/).filter(w => w.length > 0).length
    };
  }
 
  private findContentCandidates(): cheerio.Cheerio[] {
    const candidates: cheerio.Cheerio[] = [];
    
    // Common content containers
    const selectors = [
      'article', 'main', '[role="main"]',
      '.content', '.post', '.article', '.entry',
      '#content', '#main', '#article'
    ];
 
    for (const selector of selectors) {
      this.$(selector).each((_, el) => {
        candidates.push(this.$(el));
      });
    }
 
    // Also consider divs with substantial text
    this.$('div').each((_, el) => {
      const $el = this.$(el);
      if ($el.text().length > 500) {
        candidates.push($el);
      }
    });
 
    return candidates;
  }
 
  private scoreElement($el: cheerio.Cheerio): number {
    const text = $el.text();
    const html = $el.html() || '';
    
    // Text density: ratio of text to HTML
    const textDensity = text.length / Math.max(html.length, 1);
    
    // Paragraph bonus: articles have paragraphs
    const paragraphCount = $el.find('p').length;
    
    // Link penalty: navigation-heavy areas have many links
    const linkDensity = $el.find('a').text().length / Math.max(text.length, 1);
    
    return (textDensity * 100) + (paragraphCount * 10) - (linkDensity * 50);
  }
}

Handling JavaScript-Rendered Content

The modern web increasingly relies on JavaScript to render content. Single Page Applications (SPAs) may return nearly empty HTML that only populates after JavaScript execution.

Options for JavaScript content:

JavaScript Rendering Approaches
Approach	Pros	Cons	Use Case
Skip JS	Fast, simple, resource-efficient	Miss dynamic content	Static sites, blogs
Headless browser (Puppeteer/Playwright)	Full rendering, accurate	Slow, resource-intensive	SPAs, complex sites
Pre-rendering services	Outsource complexity	Cost, latency	Hybrid approach
API discovery	Direct structured data	Site-specific	Known data sources

Headless Browser Rendering
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import puppeteer from 'puppeteer';
 
class JSRenderer {
  private browser: puppeteer.Browser | null = null;
 
  async initialize(): Promise<void> {
    this.browser = await puppeteer.launch({
      headless: 'new',
      args: ['--no-sandbox', '--disable-setuid-sandbox']
    });
  }
 
  async renderPage(url: string): Promise<{ html: string; finalUrl: string }> {
    if (!this.browser) throw new Error('Browser not initialized');
 
    const page = await this.browser.newPage();
    
    try {
      // Block unnecessary resources
      await page.setRequestInterception(true);
      page.on('request', req => {
        const type = req.resourceType();
        if (['image', 'media', 'font'].includes(type)) {
          req.abort();
        } else {
          req.continue();
        }
      });
 
      await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });
      
      // Wait for dynamic content
      await page.waitForTimeout(2000);
 
      const html = await page.content();
      const finalUrl = page.url();
 
      return { html, finalUrl };
    } finally {
      await page.close();
    }
  }
 
  async close(): Promise<void> {
    await this.browser?.close();
  }
}

Selective JS Rendering

Don't render everything with a headless browser—it's too slow and expensive. Use heuristics: if the initial HTML has minimal content (<1KB text) but the page is from a known SPA framework, queue it for rendering. Most pages don't need it.

Metadata and Structured Data

Beyond main content, pages contain valuable metadata in standardized formats that enhance understanding and enable rich search results.

Metadata Sources

•HTML meta tags — title, description, keywords, author, robots directives
•Open Graph — og:title, og:description, og:image for social sharing
•Twitter Cards — twitter:title, twitter:description for Twitter previews
•Schema.org/JSON-LD — Structured data for products, articles, events, recipes
•Microdata/RDFa — Inline structured markup in HTML attributes

Metadata Extractor
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
interface PageMetadata {
  title: string;
  description: string;
  canonical: string | null;
  language: string | null;
  author: string | null;
  publishDate: string | null;
  openGraph: Record<string, string>;
  structuredData: object[];
}
 
class MetadataExtractor {
  private $: cheerio.CheerioAPI;
 
  constructor(html: string) {
    this.$ = cheerio.load(html);
  }
 
  extract(): PageMetadata {
    return {
      title: this.getTitle(),
      description: this.getDescription(),
      canonical: this.$('link[rel="canonical"]').attr('href') || null,
      language: this.$('html').attr('lang') || null,
      author: this.$('meta[name="author"]').attr('content') || null,
      publishDate: this.getPublishDate(),
      openGraph: this.getOpenGraph(),
      structuredData: this.getStructuredData(),
    };
  }
 
  private getTitle(): string {
    return this.$('title').text().trim() ||
           this.$('meta[property="og:title"]').attr('content') || '';
  }
 
  private getDescription(): string {
    return this.$('meta[name="description"]').attr('content') ||
           this.$('meta[property="og:description"]').attr('content') || '';
  }
 
  private getPublishDate(): string | null {
    const selectors = [
      'meta[property="article:published_time"]',
      'meta[name="date"]',
      'time[datetime]'
    ];
    for (const sel of selectors) {
      const val = this.$(sel).attr('content') || this.$(sel).attr('datetime');
      if (val) return val;
    }
    return null;
  }
 
  private getOpenGraph(): Record<string, string> {
    const og: Record<string, string> = {};
    this.$('meta[property^="og:"]').each((_, el) => {
      const prop = this.$(el).attr('property')?.replace('og:', '');
      const content = this.$(el).attr('content');
      if (prop && content) og[prop] = content;
    });
    return og;
  }
 
  private getStructuredData(): object[] {
    const data: object[] = [];
    this.$('script[type="application/ld+json"]').each((_, el) => {
      try {
        const json = JSON.parse(this.$(el).html() || '');
        data.push(json);
      } catch { /* ignore parse errors */ }
    });
    return data;
  }
}

Handling Diverse Content Types

Not everything on the web is HTML. Crawlers encounter PDFs, images, XML feeds, and more. Each content type requires specific handling.

Content Type Handling
Content-Type	Extraction Approach	Link Discovery
text/html	Standard HTML parsing	Extract <a>, <link>, etc.
application/pdf	PDF text extraction (pdf-parse)	Extract embedded URLs
application/xml, text/xml	XML parsing; RSS/Atom feeds	Extract <link> elements
application/json	JSON parsing; API responses	Extract URL fields
image/*	OCR for text; EXIF metadata	Limited (embedded URLs rare)
application/javascript	Skip or parse for URLs	Regex URL extraction

Content Type Router
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
class ContentProcessor {
  async process(url: string, contentType: string, body: Buffer): Promise<ProcessedContent> {
    const mimeType = contentType.split(';')[0].trim().toLowerCase();
 
    switch (mimeType) {
      case 'text/html':
      case 'application/xhtml+xml':
        return this.processHTML(body.toString('utf-8'), url);
      
      case 'application/pdf':
        return this.processPDF(body, url);
      
      case 'application/xml':
      case 'text/xml':
      case 'application/rss+xml':
      case 'application/atom+xml':
        return this.processXML(body.toString('utf-8'), url);
      
      case 'application/json':
        return this.processJSON(body.toString('utf-8'), url);
      
      default:
        return { text: '', links: [], metadata: {} };
    }
  }
 
  private async processHTML(html: string, baseUrl: string): Promise<ProcessedContent> {
    const linkExtractor = new LinkExtractor(html, baseUrl);
    const contentExtractor = new ContentExtractor(html);
    const metadataExtractor = new MetadataExtractor(html);
 
    return {
      text: contentExtractor.extractMainContent().text,
      links: linkExtractor.extractLinks().map(l => l.url),
      metadata: metadataExtractor.extract()
    };
  }
 
  private async processPDF(buffer: Buffer, baseUrl: string): Promise<ProcessedContent> {
    // Use pdf-parse or similar library
    const pdfData = await pdfParse(buffer);
    const urlRegex = /https?:\/\/[^\s<>"{}|\\^\[\]`]+/g;
    const links = (pdfData.text.match(urlRegex) || []);
 
    return {
      text: pdfData.text,
      links,
      metadata: { pageCount: pdfData.numpages }
    };
  }
 
  private processXML(xml: string, baseUrl: string): ProcessedContent {
    // Handle RSS/Atom feeds
    const $ = cheerio.load(xml, { xmlMode: true });
    const links: string[] = [];
 
    $('item link, entry link').each((_, el) => {
      const href = $(el).text() || $(el).attr('href');
      if (href) links.push(href);
    });
 
    return { text: '', links, metadata: {} };
  }
}
 
interface ProcessedContent {
  text: string;
  links: string[];
  metadata: Record<string, any>;
}

Building the Extraction Pipeline

In production, content extraction is a multi-stage pipeline that transforms raw responses into structured, searchable data.

Pipeline stages:

Content-Type routing — Direct to appropriate parser
Encoding detection — Handle charset correctly
HTML cleaning — Fix malformed markup
Boilerplate removal — Strip headers/footers/ads
Main content extraction — Identify article body
Link extraction — Discover and normalize URLs
Metadata extraction — Parse structured data
Language detection — Identify content language
Output formatting — Structure for downstream consumers

Encoding Matters

Character encoding errors corrupt extracted text. Check the Content-Type header, HTML meta charset, and BOM markers. When in doubt, use libraries like chardet for detection. Always normalize to UTF-8 for storage.

Summary: Module Complete

We've completed our comprehensive exploration of web crawler design. You now have the knowledge to architect a production-grade crawler from requirements through implementation.

Module Summary

•Requirements — Understand the scale (50B+ pages), politeness constraints, and core functional requirements
•URL Frontier — Two-level architecture with priority scheduling, deduplication, and distributed partitioning
•Politeness — robots.txt compliance, rate limiting, adaptive backoff, and crawler identification
•Duplicate Detection — Content fingerprinting, MinHash, SimHash, and LSH for scalable similarity detection
•Distributed Crawling — Consistent hashing, fault tolerance, worker architecture, and multi-region deployment
•Content Extraction — HTML parsing, link discovery, main content extraction, and structured data handling

Module Complete

You've mastered the architecture of large-scale web crawlers. These systems power search engines, price comparison sites, and countless other applications. The principles—distributed coordination, politeness, deduplication, and content extraction—apply broadly to any system that must systematically process vast amounts of external data.