Loading learning content...
Port scanning reveals services. Vulnerability scanning identifies weaknesses. But comprehensive reconnaissance extends far beyond technical probing—it encompasses all methods of gathering intelligence about a target.
Information gathering (often called OSINT—Open Source Intelligence) collects data from publicly available sources, social engineering, and passive observation. A skilled attacker knows that the most valuable information often doesn't come from port scans—it comes from job postings mentioning technology stacks, employee LinkedIn profiles revealing internal project names, GitHub repositories with accidentally committed credentials, or forgotten subdomains hosting development systems.
This page explores the full spectrum of information gathering: passive vs. active techniques, data sources, social engineering components, and how defenders can reduce their information exposure.
By mastering this page, you will: (1) Understand passive vs. active information gathering, (2) Leverage OSINT sources for technical and organizational intelligence, (3) Recognize information leakage in common business operations, (4) Apply social engineering concepts ethically, and (5) Implement controls to reduce organizational information exposure.
Information gathering techniques fall into two fundamental categories: passive (no direct target interaction) and active (direct communication with target). Understanding this distinction is crucial—it affects detectability, legality, and the type of information obtained.
Passive gathering:
The reconnaissance subject (target) is completely unaware. Information comes from:
Active gathering:
Direct interaction creates evidence of reconnaissance:
| Aspect | Passive | Active |
|---|---|---|
| Detection Risk | Nearly zero | Moderate to high |
| Legal Concerns | Generally safe | May require authorization |
| Information Depth | Surface level | Deeper, verified |
| Data Freshness | May be outdated | Current |
| Examples | Shodan, Google, Archive.org | Nmap, direct contact, social engineering |
| Time Required | Extensive (many sources) | Faster (direct answers) |
The reconnaissance spectrum:
[Fully Passive] ◄─────────────────────────────────► [Fully Active]
│ │
Shodan Google Website Email Social Port
lookups dorking browsing contact eng. scan
Practical implications:
Some activities are ambiguous. Viewing target's website seems passive, but your IP appears in their logs—technically leaving a trace. Using third-party services that probe on your behalf (e.g., subdomain enumeration services) is often considered passive since you don't interact directly.
OSINT encompasses all intelligence gathered from publicly available information. The sheer volume of useful data available without any hacking is remarkable—and often underestimated.
Categories of OSINT sources:
1234567891011121314
# WHOIS lookupwhois example.com # Certificate Transparency searchcurl "https://crt.sh/?q=%.example.com&output=json" | jq # Shodan search (requires API key)shodan search "hostname:example.com" # Historical DNScurl "https://api.securitytrails.com/v1/history/example.com/dns/a" # Wayback Machine APIcurl "https://web.archive.org/cdx/search/cdx?url=example.com&output=json"Google dorking (or Google hacking) uses advanced search operators to find information that organizations inadvertently exposed to search engines.
Core operators:
| Operator | Purpose | Example |
|---|---|---|
| site: | Limit to specific domain | site:example.com |
| inurl: | Terms must appear in URL | inurl:admin |
| intitle: | Terms in page title | intitle:"index of" |
| filetype: | Specific file extensions | filetype:pdf |
| intext: | Terms in page body | intext:password |
| cache: | Google's cached version | cache:example.com |
| "exact phrase" | Exact string match | "database error" |
| -term | Exclude results | site:example.com -www |
1234567891011121314151617181920212223242526272829
# Find subdomains indexed by Googlesite:example.com -www # Find exposed configuration filessite:example.com filetype:conf OR filetype:config OR filetype:cfg # Find directory listingssite:example.com intitle:"Index of" # Find exposed login pagessite:example.com inurl:login OR inurl:admin OR inurl:signin # Find error messages revealing infosite:example.com "database error" OR "mysql error" OR "syntax error" # Find exposed documentssite:example.com filetype:pdf OR filetype:doc OR filetype:xls # Find exposed backup filessite:example.com filetype:bak OR filetype:sql OR filetype:old # Find potentially sensitive pagessite:example.com inurl:password OR inurl:secret OR inurl:credentials # Find exposed cameras/IoTinurl:"/view/view.shtml" OR inurl:"viewerframe?mode=" # Find exposed S3 bucketssite:s3.amazonaws.com "example"The Google Hacking Database (GHDB):
Exploit-DB maintains a database of effective dorks categorized by purpose:
URL: https://www.exploit-db.com/google-hacking-database
Automation:
Tools like Googler (CLI), DorkSearch, and Shodan automate dork queries at scale:
Google limits automated searches and may flag your IP for unusual query patterns. Use delays, rotate queries, and consider using API access where available. Excessive dorking can resemble attack traffic to the sites you're researching.
Deep investigation of domain and infrastructure relationships reveals organizational scope, hosting decisions, and potential attack vectors beyond the obvious main website.
Domain intelligence sources:
12345678910111213141516171819202122
# Reverse WHOIS (find all domains by registrant)# ViewDNS.info, DomainTools (commercial) # ASN lookupwhois AS12345# Shows all IP prefixes announced by organization # Find subdomains via multiple sourcesamass enum -d example.com -passive # Certificate transparency subdomain search subfinder -d example.com -all -silent # Reverse IP lookupcurl "https://api.hackertarget.com/reverseiplookup/?q=192.168.1.1" # Cloud bucket discovery# Common patterns: {company}, {company}-dev, {company}-backupaws s3 ls s3://example-company/ --no-sign-request # Shodan host lookupshodan host 192.168.1.1Attack surface expansion:
Starting with one domain, good OSINT often reveals:
example.com
├── www.example.com (main website)
├── mail.example.com (email infrastructure)
├── vpn.example.com (🎯 VPN endpoint)
├── dev.example.com (🎯 development server)
├── staging.example.com (🎯 possibly weaker security)
├── api.example.com (🎯 API surface)
├── jenkins.example.com (🎯 CI/CD system)
└── related discoveries:
├── exampleinc.com (alternate domain)
├── acq-target.com (recent acquisition)
└── example-internal.slack.com (cloud services)
Each discovery becomes a potential attack vector. Development and staging environments often have weaker controls than production.
Acquired companies are often security weak points. They may have been integrated operationally but not security-wise—retaining old infrastructure, different security standards, or forgotten systems. Always research parent/subsidiary relationships.
The most sophisticated technical defenses can be bypassed by exploiting human psychology. Social engineering intelligence gathers information that enables manipulation of people rather than systems.
What social engineers look for:
| Category | Examples | Attack Use |
|---|---|---|
| Organizational Structure | Who reports to whom, department names | Impersonation, authority exploitation |
| Employee Details | Names, roles, email formats, phone numbers | Spear phishing, pretexting |
| Operational Patterns | Work hours, office locations, remote work | Timing attacks, in-person access |
| Technology Context | Help desk processes, vendor names | Tech support impersonation |
| Personal Interests | Hobbies, social groups, family info | Building rapport, pretext creation |
| Current Events | Recent projects, reorganizations, outages | Timely pretexts |
Building employee lists:
Comprehensive employee enumeration enables:
Methods:
# LinkedIn - Company page followers/employees
# Tool: linkedin2username, CrossLinked
# GitHub - Contributors to company repositories
git log --format='%aN <%aE>' | sort -u
# Email harvesting from web
theHarvester -d example.com -b all
# Hunter.io - Email format discovery
curl "https://api.hunter.io/v2/domain-search?domain=example.com"
# Google dork for email addresses
site:example.com "@example.com"
Gathering information about individuals raises significant ethical and legal concerns. In legitimate assessments, scope explicitly defines what's permitted. Never stalk, harass, or gather information on individuals without clear authorization. Privacy laws (GDPR, CCPA, etc.) may restrict certain data collection.
Pretexting development:
With gathered intelligence, attackers craft believable pretexts:
| Gathered Intel | Resulting Pretext |
|---|---|
| Uses Office 365 + IT director name | "Hi, John from IT, there's an Office 365 issue affecting accounts" |
| Recent acquisition of Company B | "I'm from Company B, trying to set up my new access" |
| Uses ServiceNow for tickets | "This is ServiceNow support, we detected suspicious activity" |
| CEO travels frequently | "The CEO needs this wire transfer approved while traveling" |
Files published by organizations often contain metadata—hidden information about their creation and the systems that created them. This metadata can reveal usernames, software versions, internal paths, and more.
Common metadata sources:
12345678910111213141516171819202122
# ExifTool - comprehensive metadata extractionexiftool document.pdf# Output may include:# Creator: John.Smith# Operating System: Windows 10# PDF Producer: Microsoft Word 2019# Create Date: 2024:01:15 14:30:22 # Image metadata (may reveal location)exiftool photo.jpg# GPS Position: 37.7749° N, 122.4194° W# Camera Make: Apple# Camera Model: iPhone 15 Pro # FOCA - automated document metadata analysis (Windows)# Analyzes enterprise document collections # Web page metadatacurl -s https://example.com | grep -E "generator|author|framework" # Metagoofil - automated document download and analysismetagoofil -d example.com -t pdf,doc,xls -l 100 -o output/What metadata reveals:
| Metadata Type | Example | Intelligence Value |
|---|---|---|
| Author/Creator | jsmith, john.smith@corp.local | Username format, domain name |
| Software Version | Microsoft Word 2016 | Indicates patch levels, potential vulns |
| Internal Paths | C:\Users\jsmith\Documents\Confidential\ | Directory structure, user naming |
| Network Paths | \\fileserver\share\HR\ | Internal server names, SMB shares |
| GPS Coordinates | 37.7749, -122.4194 | Physical locations, executive travel |
| Printer Names | HP-LaserJet-Floor3-Legal | Office layout, department locations |
Download all PDFs, Word docs, and Excel files from target's website, then batch-analyze metadata. This often reveals: consistent username patterns, internal server names, software versions, and sometimes credentials embedded in file properties.
Public code repositories (GitHub, GitLab, Bitbucket) are treasure troves for reconnaissance. Developers accidentally commit secrets, configuration files reveal infrastructure details, and code history shows evolution of systems.
What to search for:
| Target | Search Terms | Value |
|---|---|---|
| API keys/secrets | api_key, apikey, secret, token | Direct access to services |
| Passwords | password=, passwd, pwd | Potential valid credentials |
| AWS credentials | AKIA, aws_secret_access_key | Cloud account access |
| Database credentials | mysql://, postgres://, mongodb:// | Database access |
| Internal URLs | internal, staging, dev, corp | Internal infrastructure |
| SSH keys | BEGIN RSA PRIVATE KEY | Server access |
| Configuration files | .env, config.yml, settings.json | Infrastructure details |
12345678910111213141516171819202122
# GitHub search operatorsorg:example-company passwordorg:example-company filename:.envorg:example-company extension:pemorg:example-company "api_key ="org:example-company "AKIA" # AWS access key prefix # Automated secret scanning tools# TruffleHog - scans git history for secretstrufflehog git https://github.com/example/repo.git # GitLeaks - SAST tool for secretsgitleaks detect --source=/path/to/repo # Gitrob - organizational scanning (deprecated but pattern useful)# Modern alternatives: GitHound, shhgit # Search commit historygit log -p | grep -E "(password|secret|api_key|token)" # Find deleted (but still in history) secretsgit log --diff-filter=D --summary | grep -E "\.(env|pem|key)"Git history awareness:
Even if secrets are removed from current code, git history preserves them forever unless explicitly purged:
# Show file at any historical commit
git show <commit-hash>:path/to/file.env
# Find when a file was deleted
git log --all --full-history -- "**/secret.txt"
# Check if sensitive file ever existed
git log --all --oneline -- "**/.env*"
Common findings:
Finding a secret in git history means it must be considered compromised. Removing it from the current version doesn't help if attackers already cloned the repo. All discovered secrets must be rotated immediately.
From a defensive perspective, minimizing information exposure reduces reconnaissance effectiveness. You can't prevent all OSINT collection, but you can reduce what attackers learn.
Strategic information reduction:
Proactive monitoring:
Monitor for your own exposure:
# Google alerts for company name + sensitive terms
# "Example Corp" password OR leaked OR breach
# Monitor Shodan for your IP ranges
shodan alert create "My Company" net:192.168.0.0/16
# GitHub code search for your domain
# org:* @example.com password
# Have I Been Pwned domain search
# Monitor for corporate emails in breach databases
# Certificate transparency monitoring
# Get alerted when new certs issued for your domain
Regular OSINT assessments:
Conduct periodic OSINT against your own organization:
The only way to understand your OSINT exposure is to conduct reconnaissance against yourself. Regular self-assessment reveals what attackers will find—and gives you the opportunity to reduce exposure before it's exploited.
Information gathering extends reconnaissance far beyond technical scanning—encompassing all intelligence sources that build target understanding. Let's consolidate the key concepts:
What's next:
With comprehensive reconnaissance coverage complete—port scanning, network mapping, vulnerability scanning, and information gathering—we'll conclude with Detection and Prevention. This final page covers how defenders detect reconnaissance activities and implement controls to prevent or limit intelligence gathering.
You now understand the full scope of information gathering—from Google dorking to social engineering intelligence. You can conduct comprehensive OSINT and implement defensive measures to reduce your organization's exposure. Next, we'll examine detection and prevention strategies for the complete reconnaissance lifecycle.