Guides

How to Use theHarvester for OSINT: See What Attackers See Before They Do

theHarvester pulls emails, subdomains, and IPs from public sources in seconds. Run it on your own domain and you might not like what you find.

Before an attacker targets your organization, they do recon. They're not starting with zero-days or brute force — they're starting with public data. Your domain name, your SSL certificates, your email addresses, the subdomains you forgot to decommission. All of it is passively queryable without touching your servers.

theHarvester automates that first phase. It pulls emails, subdomains, IP addresses, and employee names from search engines, certificate transparency logs, DNS aggregators, and other public sources. Security teams use it in penetration testing engagements. It ships by default in Kali Linux. Run it against your own domain, and you get a snapshot of exactly what an attacker sees before they engage.

What It Actually Does

theHarvester queries public aggregators — not your target's infrastructure directly. That distinction matters. When you run it against your domain using passive sources, your server never sees a single packet from theHarvester. Everything comes from third-party data stores: Google's search index, Bing, LinkedIn's people search, certificate transparency logs via crt.sh, urlscan.io's passive scanning database.

This is OSINT in the strictest sense. Publicly available, legally accessible data. The tool's value is aggregation speed — a human could find all of this manually; theHarvester does it in seconds.

What it finds:

Email addresses: direct harvest from search engine results and data aggregators
Subdomains: the bigger haul, especially from certificate transparency logs
IP addresses and hostnames: from DNS records and passive DNS datasets
Employee names: from LinkedIn enumeration (limited without an API key)

The email and subdomain findings are where the real risk surface shows up.

Installation

Kali Linux: theHarvester is already installed. Just run it.

Debian/Ubuntu via pip:

pip3 install theHarvester

From source (recommended if pip version is behind):

git clone https://github.com/laramies/theHarvester
cd theHarvester
python3 -m pip install -r requirements/base.txt
python3 theHarvester.py -h

The git version stays current with new source integrations; the pip package sometimes lags. For anything beyond basic use, cloning the repo is the safer choice.

Basic Usage

theHarvester -d example.com -b google,bing,linkedin,crtsh -l 200

-d is your target domain
-b is the comma-separated list of sources to query
-l limits results per source (default 500 — fine to leave as-is for most domains)

The most reliable sources that work without API keys: crtsh, bing, hackertarget, dnsdumpster. Google is increasingly rate-limited and often returns thin results without a paid API key. For subdomain enumeration specifically, crtsh alone is often the most useful starting point — certificate transparency logs are comprehensive and have no rate limiting.

For a broader passive sweep:

theHarvester -d example.com -b google,bing,crtsh,urlscan,hackertarget,dnsdumpster -l 300

Save output to HTML and XML:

theHarvester -d example.com -b crtsh,bing,hackertarget -l 500 -f myreport

This creates myreport.html and myreport.xml in the current directory. The HTML output is clean and easy to share with a team.

Reading the Output

The terminal output gives you four categories to pay attention to:

Emails found — Direct harvest from search results and public sources. Each email is a spear phishing vector, a password spray target, a credential stuffing entry. If you see [email protected] and [email protected], you can now generate [email protected] with high confidence — most organizations use consistent naming conventions, and theHarvester just revealed the pattern.

Subdomains — This is where things get interesting. Expected subdomains (www, mail, vpn) are fine. The concerning ones: dev.example.com, staging.example.com, old.example.com, jenkins.example.com, gitlab.example.com. These are the forgotten services — stood up for a project, never decommissioned, running software that hasn't been patched since 2022. Certificate transparency logs are exhaustive here because every SSL/TLS certificate ever issued for a subdomain leaves a permanent public record. There's no expiry on CT logs.

IPs and ASNs — Reveals your hosting provider, cloud regions, CDN configuration. Mostly informational, but useful for correlating infrastructure and understanding your overall footprint.

Employee names — LinkedIn enumeration gives partial org chart data. Limited without a paid API key, but even partial results show team structure that's useful for impersonation attacks.

What to Do With What You Find

Exposed emails: run them through Have I Been Pwned. Any address that shows up in breach data is a credential stuffing risk. If you're the security contact, you can use HIBP's domain search feature to check your entire domain at once.

Unexpected subdomains: investigate each one. What is it running? When was it last updated? If it's not intentionally public-facing, either firewall it or take it down. This is shadow IT made visible — tools and services staff spun up independently that the security team never knew existed.

The email naming convention problem: if theHarvester reveals your format is firstname.lastname@, you've now disclosed your address generation logic. Consider whether your public-facing security contacts (abuse@, security@, info@) are worth keeping separate from the personal-name convention — it limits the blast radius.

This connects to the broader question of what your organization is inadvertently publishing, which I covered in the digital footprint guide. theHarvester is the automated version of that audit, applied at the domain level rather than the individual level.

Certificate Transparency

Certificate transparency logs deserve a separate mention because they're both the most powerful and least understood source in theHarvester's toolkit.

When any certificate authority issues an SSL/TLS certificate, they must log it to a public CT log. This is a security requirement — it lets anyone detect misissued certificates. The side effect: every subdomain that has ever had a certificate issued for it is permanently public. crt.sh aggregates these logs and makes them searchable.

This is why crtsh is consistently the most productive source in theHarvester. You can also check crt.sh directly in a browser — search for %.example.com to see all subdomains ever certificated for that domain. The results can be alarming for organizations that have been around for a while.

I mentioned CT logs briefly in the typosquatting and phishing domain guide in a different context — they're also how you can catch attackers registering lookalike certs for your domain.

One Hard Rule

Only run theHarvester against domains you own or have explicit written authorization to test.

"It's public data" doesn't insulate you from legal risk in most jurisdictions. The Computer Fraud and Abuse Act in the US, the Computer Misuse Act in the UK, and analogous laws in most countries have provisions that can cover unauthorized reconnaissance — even passive OSINT — depending on how prosecutors want to read them. Bug bounty programs typically permit passive recon against in-scope domains; check the program's specific rules before running anything.

This is also covered in Google dorking guide's ethical use section — the principle is the same. Public data, legitimate tool, but the target matters.

Useful for More Than Just Offense

The most common use I see for theHarvester isn't actually in pentesting engagements. It's routine security hygiene: run it against your own domain every few months and see what's accumulated. New subdomains you didn't know about. Old emails from former employees still indexed. A development server someone stood up and forgot.

The findings often aren't catastrophic. But they're always informative, and sometimes they reveal something that would have been a bad day for an attacker to find first.