Loading content...
Fetching a web page is only half the battle. The raw HTML returned by a server is a complex mix of content, navigation, advertisements, scripts, and boilerplate. Extracting the meaningful content and discovering links for further crawling is where real value is created.
Content extraction transforms raw bytes into structured, indexable information. Done well, it enables search engines to understand page topics, news aggregators to identify articles, and price comparison sites to extract product details. Done poorly, it produces garbage data that corrupts downstream systems.
By the end of this page, you will understand: (1) HTML parsing strategies and libraries, (2) Link extraction and URL resolution, (3) Main content extraction techniques, (4) Handling JavaScript-rendered content, (5) Metadata extraction (titles, descriptions, structured data), and (6) Dealing with diverse content types.
Real-world HTML is messy. Pages contain syntax errors, unclosed tags, and malformed markup. A robust parser must handle the worst while still extracting value.
Parser types:
| Library | Language | Type | Malformed HTML |
|---|---|---|---|
| cheerio | JavaScript/Node | DOM (jQuery-like) | Good |
| jsdom | JavaScript/Node | Full DOM | Excellent |
| BeautifulSoup | Python | DOM | Excellent |
| lxml | Python | DOM/SAX | Good |
| Jsoup | Java | DOM | Excellent |
| htmlparser2 | JavaScript | SAX/streaming | Good |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import * as cheerio from 'cheerio'; class HTMLParser { private $: cheerio.CheerioAPI; constructor(html: string) { this.$ = cheerio.load(html, { decodeEntities: true, lowerCaseTags: true, lowerCaseAttributeNames: true, }); } // Extract page title getTitle(): string { return this.$('title').first().text().trim() || this.$('meta[property="og:title"]').attr('content') || ''; } // Extract meta description getDescription(): string { return this.$('meta[name="description"]').attr('content') || this.$('meta[property="og:description"]').attr('content') || ''; } // Extract all links getLinks(): Array<{ href: string; text: string; rel: string }> { const links: Array<{ href: string; text: string; rel: string }> = []; this.$('a[href]').each((_, el) => { const $el = this.$(el); links.push({ href: $el.attr('href') || '', text: $el.text().trim(), rel: $el.attr('rel') || '' }); }); return links; } // Extract canonical URL getCanonical(): string | null { return this.$('link[rel="canonical"]').attr('href') || null; }}Discovering new URLs is a core crawler function. Links appear in various forms and must be normalized before adding to the frontier.
Link sources in HTML:
<a href="..."> — Standard hyperlinks<link rel="canonical" href="..."> — Canonical URL<link rel="alternate" href="..."> — Alternate versions<img src="...">, <script src="..."> — Resources (usually filtered)<form action="..."> — Form targets (often excluded)data-* attributes — Framework-specific patterns1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
class LinkExtractor { private baseUrl: string; private $ : cheerio.CheerioAPI; constructor(html: string, baseUrl: string) { this.$ = cheerio.load(html); // Check for <base> tag which overrides base URL const baseTag = this.$('base[href]').attr('href'); this.baseUrl = baseTag ? new URL(baseTag, baseUrl).href : baseUrl; } extractLinks(): Array<{ url: string; context: LinkContext }> { const links: Array<{ url: string; context: LinkContext }> = []; this.$('a[href]').each((_, el) => { const $el = this.$(el); const href = $el.attr('href'); if (!href) return; const resolved = this.resolveUrl(href); if (!resolved) return; // Skip non-crawlable protocols if (!resolved.startsWith('http://') && !resolved.startsWith('https://')) { return; } links.push({ url: resolved, context: { anchorText: $el.text().trim().slice(0, 200), rel: $el.attr('rel') || '', isNavigation: this.isNavigationLink($el), isNofollow: ($el.attr('rel') || '').includes('nofollow'), } }); }); return this.deduplicateLinks(links); } private resolveUrl(href: string): string | null { try { // Handle protocol-relative URLs if (href.startsWith('//')) { href = 'https:' + href; } return new URL(href, this.baseUrl).href; } catch { return null; // Invalid URL } } private isNavigationLink($el: cheerio.Cheerio): boolean { // Check if link is in nav, header, footer return $el.closest('nav, header, footer, .navigation, .menu').length > 0; } private deduplicateLinks(links: Array<{ url: string; context: LinkContext }>) { const seen = new Set<string>(); return links.filter(link => { if (seen.has(link.url)) return false; seen.add(link.url); return true; }); }} interface LinkContext { anchorText: string; rel: string; isNavigation: boolean; isNofollow: boolean;}Links marked with rel="nofollow" indicate the site owner doesn't endorse the target. While you may still crawl these URLs, they should receive lower priority in ranking calculations. Similarly, sponsored and ugc (user-generated content) links carry different trust signals.
Separating main content from boilerplate (headers, footers, sidebars, ads) is essential for both indexing and duplicate detection. Several approaches exist:
Extraction techniques:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
class ContentExtractor { private $: cheerio.CheerioAPI; constructor(html: string) { this.$ = cheerio.load(html); this.removeBoilerplate(); } private removeBoilerplate(): void { // Remove non-content elements this.$('script, style, noscript, iframe, svg').remove(); this.$('header, footer, nav, aside').remove(); this.$('[role="navigation"], [role="banner"], [role="contentinfo"]').remove(); this.$('.advertisement, .ads, .sidebar, .comments').remove(); } extractMainContent(): { text: string; wordCount: number } { // Find content candidates const candidates = this.findContentCandidates(); // Score candidates by text density let bestCandidate = { element: this.$('body'), score: 0 }; for (const candidate of candidates) { const score = this.scoreElement(candidate); if (score > bestCandidate.score) { bestCandidate = { element: candidate, score }; } } const text = bestCandidate.element.text() .replace(/\s+/g, ' ') .trim(); return { text, wordCount: text.split(/\s+/).filter(w => w.length > 0).length }; } private findContentCandidates(): cheerio.Cheerio[] { const candidates: cheerio.Cheerio[] = []; // Common content containers const selectors = [ 'article', 'main', '[role="main"]', '.content', '.post', '.article', '.entry', '#content', '#main', '#article' ]; for (const selector of selectors) { this.$(selector).each((_, el) => { candidates.push(this.$(el)); }); } // Also consider divs with substantial text this.$('div').each((_, el) => { const $el = this.$(el); if ($el.text().length > 500) { candidates.push($el); } }); return candidates; } private scoreElement($el: cheerio.Cheerio): number { const text = $el.text(); const html = $el.html() || ''; // Text density: ratio of text to HTML const textDensity = text.length / Math.max(html.length, 1); // Paragraph bonus: articles have paragraphs const paragraphCount = $el.find('p').length; // Link penalty: navigation-heavy areas have many links const linkDensity = $el.find('a').text().length / Math.max(text.length, 1); return (textDensity * 100) + (paragraphCount * 10) - (linkDensity * 50); }}The modern web increasingly relies on JavaScript to render content. Single Page Applications (SPAs) may return nearly empty HTML that only populates after JavaScript execution.
Options for JavaScript content:
| Approach | Pros | Cons | Use Case |
|---|---|---|---|
| Skip JS | Fast, simple, resource-efficient | Miss dynamic content | Static sites, blogs |
| Headless browser (Puppeteer/Playwright) | Full rendering, accurate | Slow, resource-intensive | SPAs, complex sites |
| Pre-rendering services | Outsource complexity | Cost, latency | Hybrid approach |
| API discovery | Direct structured data | Site-specific | Known data sources |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
import puppeteer from 'puppeteer'; class JSRenderer { private browser: puppeteer.Browser | null = null; async initialize(): Promise<void> { this.browser = await puppeteer.launch({ headless: 'new', args: ['--no-sandbox', '--disable-setuid-sandbox'] }); } async renderPage(url: string): Promise<{ html: string; finalUrl: string }> { if (!this.browser) throw new Error('Browser not initialized'); const page = await this.browser.newPage(); try { // Block unnecessary resources await page.setRequestInterception(true); page.on('request', req => { const type = req.resourceType(); if (['image', 'media', 'font'].includes(type)) { req.abort(); } else { req.continue(); } }); await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 }); // Wait for dynamic content await page.waitForTimeout(2000); const html = await page.content(); const finalUrl = page.url(); return { html, finalUrl }; } finally { await page.close(); } } async close(): Promise<void> { await this.browser?.close(); }}Don't render everything with a headless browser—it's too slow and expensive. Use heuristics: if the initial HTML has minimal content (<1KB text) but the page is from a known SPA framework, queue it for rendering. Most pages don't need it.
Beyond main content, pages contain valuable metadata in standardized formats that enhance understanding and enable rich search results.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
interface PageMetadata { title: string; description: string; canonical: string | null; language: string | null; author: string | null; publishDate: string | null; openGraph: Record<string, string>; structuredData: object[];} class MetadataExtractor { private $: cheerio.CheerioAPI; constructor(html: string) { this.$ = cheerio.load(html); } extract(): PageMetadata { return { title: this.getTitle(), description: this.getDescription(), canonical: this.$('link[rel="canonical"]').attr('href') || null, language: this.$('html').attr('lang') || null, author: this.$('meta[name="author"]').attr('content') || null, publishDate: this.getPublishDate(), openGraph: this.getOpenGraph(), structuredData: this.getStructuredData(), }; } private getTitle(): string { return this.$('title').text().trim() || this.$('meta[property="og:title"]').attr('content') || ''; } private getDescription(): string { return this.$('meta[name="description"]').attr('content') || this.$('meta[property="og:description"]').attr('content') || ''; } private getPublishDate(): string | null { const selectors = [ 'meta[property="article:published_time"]', 'meta[name="date"]', 'time[datetime]' ]; for (const sel of selectors) { const val = this.$(sel).attr('content') || this.$(sel).attr('datetime'); if (val) return val; } return null; } private getOpenGraph(): Record<string, string> { const og: Record<string, string> = {}; this.$('meta[property^="og:"]').each((_, el) => { const prop = this.$(el).attr('property')?.replace('og:', ''); const content = this.$(el).attr('content'); if (prop && content) og[prop] = content; }); return og; } private getStructuredData(): object[] { const data: object[] = []; this.$('script[type="application/ld+json"]').each((_, el) => { try { const json = JSON.parse(this.$(el).html() || ''); data.push(json); } catch { /* ignore parse errors */ } }); return data; }}Not everything on the web is HTML. Crawlers encounter PDFs, images, XML feeds, and more. Each content type requires specific handling.
| Content-Type | Extraction Approach | Link Discovery |
|---|---|---|
| text/html | Standard HTML parsing | Extract <a>, <link>, etc. |
| application/pdf | PDF text extraction (pdf-parse) | Extract embedded URLs |
| application/xml, text/xml | XML parsing; RSS/Atom feeds | Extract <link> elements |
| application/json | JSON parsing; API responses | Extract URL fields |
| image/* | OCR for text; EXIF metadata | Limited (embedded URLs rare) |
| application/javascript | Skip or parse for URLs | Regex URL extraction |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
class ContentProcessor { async process(url: string, contentType: string, body: Buffer): Promise<ProcessedContent> { const mimeType = contentType.split(';')[0].trim().toLowerCase(); switch (mimeType) { case 'text/html': case 'application/xhtml+xml': return this.processHTML(body.toString('utf-8'), url); case 'application/pdf': return this.processPDF(body, url); case 'application/xml': case 'text/xml': case 'application/rss+xml': case 'application/atom+xml': return this.processXML(body.toString('utf-8'), url); case 'application/json': return this.processJSON(body.toString('utf-8'), url); default: return { text: '', links: [], metadata: {} }; } } private async processHTML(html: string, baseUrl: string): Promise<ProcessedContent> { const linkExtractor = new LinkExtractor(html, baseUrl); const contentExtractor = new ContentExtractor(html); const metadataExtractor = new MetadataExtractor(html); return { text: contentExtractor.extractMainContent().text, links: linkExtractor.extractLinks().map(l => l.url), metadata: metadataExtractor.extract() }; } private async processPDF(buffer: Buffer, baseUrl: string): Promise<ProcessedContent> { // Use pdf-parse or similar library const pdfData = await pdfParse(buffer); const urlRegex = /https?:\/\/[^\s<>"{}|\\^\[\]`]+/g; const links = (pdfData.text.match(urlRegex) || []); return { text: pdfData.text, links, metadata: { pageCount: pdfData.numpages } }; } private processXML(xml: string, baseUrl: string): ProcessedContent { // Handle RSS/Atom feeds const $ = cheerio.load(xml, { xmlMode: true }); const links: string[] = []; $('item link, entry link').each((_, el) => { const href = $(el).text() || $(el).attr('href'); if (href) links.push(href); }); return { text: '', links, metadata: {} }; }} interface ProcessedContent { text: string; links: string[]; metadata: Record<string, any>;}In production, content extraction is a multi-stage pipeline that transforms raw responses into structured, searchable data.
Pipeline stages:
Character encoding errors corrupt extracted text. Check the Content-Type header, HTML meta charset, and BOM markers. When in doubt, use libraries like chardet for detection. Always normalize to UTF-8 for storage.
We've completed our comprehensive exploration of web crawler design. You now have the knowledge to architect a production-grade crawler from requirements through implementation.
You've mastered the architecture of large-scale web crawlers. These systems power search engines, price comparison sites, and countless other applications. The principles—distributed coordination, politeness, deduplication, and content extraction—apply broadly to any system that must systematically process vast amounts of external data.