Ethical Hacking Lesson 8 – Information Gathering Tools | Dataplexa
Foundations & Hacking Mindset · Lesson 8

Information Gathering Tools

The recon phase is only as good as the tools behind it. This lesson walks through the specific tools professionals reach for at each stage of information gathering — what each one does, when to use it, and what the output actually tells you.

Tools don't replace thinking — they accelerate it

A common mistake beginners make is treating recon tools like a magic button. Run the tool, get the answer, move on. That approach misses the point entirely. Tools surface raw data — it is the pen tester's job to interpret that data, spot the anomalies, connect the dots, and decide what is worth pursuing further.

A port scanner that finds 47 open services on a server is not useful on its own. A pen tester who looks at those 47 services, notices that one of them is running software three major versions behind, cross-references it against a CVE database, and realises it is exploitable without authentication — that is useful.

With that framing in place, here are the tools you will use most frequently during information gathering — all taught here strictly for authorised engagements.

The core toolkit — seven tools every pen tester knows

These are not obscure specialist tools. They appear in almost every professional engagement at some point. Some are passive, some are active — the table below maps each one so you know exactly what category it falls into before you reach for it.

INFORMATION GATHERING TOOLS — QUICK REFERENCE
Tool Primary use Type Needs auth?
theHarvester Collect emails, subdomains, IPs and employee names from public sources Passive No
Shodan Search engine for internet-facing devices — find open ports and services without scanning Passive No
Maltego Visual intelligence mapping — connects people, domains, IPs and organisations into a relationship graph Passive No
Nmap Port scanning, service detection and OS fingerprinting against live targets Active Yes
dig / nslookup DNS record lookups — find A, MX, TXT, NS and CNAME records for a domain Passive No
Recon-ng Modular recon framework — automates multiple intelligence gathering tasks in a structured workflow Passive No
WHOIS Domain registration records — registrar, owner organisation, creation date, nameservers Passive No

Six of the seven tools above are passive — they gather information without touching the target. Only Nmap is active. That ratio reflects how a real engagement is run. The majority of useful intelligence is gathered before a single active packet leaves your machine.

theHarvester — gathering emails and names from public sources

theHarvester is one of the first tools a pen tester runs. It queries search engines, LinkedIn, DNS records, and other public sources to pull together a list of email addresses, employee names, subdomains, and IP addresses associated with a target domain. All passive — it never contacts the target's systems directly.

The email addresses it finds are particularly valuable. Cross-reference them against a breach database and you may find that several employees are still using passwords that were leaked in a previous breach somewhere else — a credential stuffing opportunity that goes straight into the report.

The scenario: You are on day one of a black box engagement against a UK-based law firm. Your team lead asks you to run theHarvester to build an initial picture of the target's email footprint and any subdomains visible from public sources. Nothing in this step touches the firm's servers.

# theHarvester gathers public intelligence about a target domain
# It searches multiple sources simultaneously — no packets hit the target

# -d specifies the target domain to investigate
# -b specifies which data sources to search
#    "all" tells it to query every available source at once
#    You can also specify individual sources like google, bing, linkedin
# -l 200 limits the results to 200 entries so the output stays manageable

theHarvester -d targetlawfirm.co.uk -b all -l 200

Breaking it down:

-d targetlawfirm.co.uk
The target domain. theHarvester builds all its searches around this — it will look for email addresses at this domain, subdomains under it, and IP addresses associated with it.
-b all
Tells theHarvester to search every available source — Google, Bing, LinkedIn, DNS databases, certificate logs. Using "all" is slower but more thorough. On time-limited engagements, you might specify only the most productive sources like google,linkedin,crtsh.
admin@targetlawfirm.co.uk and hr@targetlawfirm.co.uk
Generic role-based addresses like admin@ and hr@ are worth checking against breach databases immediately. They are often shared credentials — several people know the password — which makes them both more likely to have been leaked and harder to change without disrupting operations.
staging.targetlawfirm.co.uk
Appearing again — as it does in almost every engagement. A law firm's staging environment is even more sensitive than a retailer's, because it likely contains draft legal documents and client data being tested against a less-secured system.

Shodan — the search engine for internet-connected devices

Shodan is unlike any other search engine you have used. Where Google indexes websites, Shodan indexes devices — every server, router, webcam, industrial control system, and IoT device connected to the internet. It continuously scans the entire internet and stores what it finds: open ports, running services, software versions, and banner information.

For a pen tester, Shodan means you can often discover what software a target's server is running before making any direct contact with it. That is significant — it turns active scanning information into passive intelligence.

The scenario: theHarvester returned an IP address — 89.44.12.201 — associated with the law firm's VPN subdomain. Before running any active scans against it, you check Shodan to see if it has already indexed this IP and knows what is running on it.

# Shodan CLI tool — queries Shodan's database for a specific IP address
# This is completely passive — no packets reach 89.44.12.201
# Shodan already scanned this IP previously and stored the results
# We are just reading their index, not contacting the server ourselves

# shodan host performs a lookup on a specific IP address
# Replace the IP with the one you found during theHarvester scan
shodan host 89.44.12.201

Breaking it down:

Fortinet SSL-VPN 6.4.2 on port 8443
Shodan identified the exact software and version running on this port from the banner it returned during a previous scan. That version number is everything — it is what you cross-reference against vulnerability databases.
CVE-2022-40684 — CVSS 9.6
A CVE (Common Vulnerabilities and Exposures) is a publicly documented security flaw with a unique ID. CVSS 9.6 out of 10 is critical severity. This specific CVE allows an attacker to bypass authentication on the VPN entirely — gaining access without any valid credentials. Found passively, before a single active scan.
OpenSSH 7.4 on port 22
SSH version 7.4 is significantly outdated — released in 2016. While not as immediately critical as the Fortinet finding, it suggests a pattern of neglected updates on this server that may extend to other services.

This is a critical finding. A CVSS 9.6 authentication bypass on the target's VPN gateway — found passively in under a minute, before the active phase even started. This goes straight into the report as a priority one finding and gets escalated to the client immediately rather than waiting until the end of the engagement.

dig — reading DNS records directly

dig stands for Domain Information Groper. It is a command-line tool for querying DNS records — the public database that maps domain names to IP addresses and other technical details. Every domain has a set of DNS records, and reading them carefully often reveals infrastructure details the company did not intend to expose.

DNS records come in several types. The most useful ones for recon are MX records (which mail server handles email for this domain), TXT records (which often contain SPF and DMARC security configurations), and NS records (which reveal the DNS provider). Each tells a different part of the story.

The scenario: You want to understand the law firm's email infrastructure before considering a phishing simulation. A dig query against their MX records will tell you exactly which mail server handles their incoming email — and whether it has any security configurations in place that might affect delivery of a test phishing email.

# dig queries the public DNS system for records about a domain
# This is passive — it asks public DNS servers, not the target's systems

# MX records tell us which mail server handles email for this domain
# The priority number before the server name controls which server is tried first
dig MX targetlawfirm.co.uk

# TXT records often contain SPF rules — which servers are allowed to send email
# as this domain. Missing or weak SPF makes phishing simulation easier.
dig TXT targetlawfirm.co.uk

# +short gives a cleaner, shorter output — useful when you just need the values
dig MX targetlawfirm.co.uk +short

Breaking it down:

MX 10 and MX 20
The number before the mail server is its priority. Lower number = higher priority. Email gets sent to the MX 10 server first. If that fails, it falls back to MX 20. Two mail servers means redundancy — and two targets to investigate.
v=spf1 include:mailgun.org ~all
This is the SPF record — it declares which servers are authorised to send email claiming to be from this domain. The ~all at the end means "soft fail" — emails from unauthorised servers are marked as suspicious but not rejected outright. A stricter -all would reject them entirely. This SPF configuration means phishing emails that fail the SPF check will still be delivered, just potentially flagged.
+short
Without +short, dig returns the full DNS response including timing data, TTL values, and query metadata. With +short, you get just the answer — much cleaner when you need to quickly read a result or pipe it into another command.

Putting the tools together — a complete recon workflow

In practice, these tools are not used in isolation. A professional passive recon phase runs them in sequence — each one adding a layer to the picture. Here is how that workflow maps out from first command to completed intelligence profile.

PASSIVE RECON WORKFLOW — step by step
1

WHOIS — domain registration and ownership

Start here. Get the domain age, registrar, nameservers, and organisation name. Sets the baseline for everything that follows.

2

theHarvester — emails, subdomains, and people

Run against the primary domain. Builds the email footprint and surfaces subdomains you may not have known existed.

3

dig — DNS records and email security posture

Pull MX, TXT, NS and A records. Check SPF and DMARC configuration. Note any missing or weak records.

4

Shodan — known services and vulnerabilities on discovered IPs

Take every IP address found so far and check it in Shodan. Look for exposed services, software versions, and any CVEs already flagged against them.

5

Compile the profile — prioritise and document

Bring everything together into a structured document. Prioritise findings by risk. Identify the highest-value targets for the active phase. Log every source and timestamp.

That workflow can be completed in two to three hours for a standard target. The output — a structured intelligence profile with prioritised findings — is what the active scanning phase is built on. A weak passive phase means a weak active phase.

Teacher's Note: Every tool in this lesson is pre-installed on Kali Linux — you will not need to set anything up manually. When you reach the lab setup lesson, you will have a working environment where you can run all of these commands against practice targets in a completely legal, isolated setting.

Practice questions

Scenario:

During passive recon on a target company, a pen tester discovers an IP address associated with the target's mail server. Without sending any packets to that IP, they want to find out what software is running on it, what ports are open, and whether any known vulnerabilities have been publicly documented against that software version. Which tool lets them do all of this passively?


Scenario:

A pen tester is preparing for a phishing simulation as part of a social engineering test. Before crafting any emails, they need to know which mail server handles incoming email for the target domain and what priority that server has. They run a dig query specifying a particular DNS record type to get this information. Which record type do they query?


Scenario:

On day one of a black box engagement, a pen tester needs to quickly build a list of email addresses associated with the target organisation — without touching any of their systems. They want to search Google, LinkedIn, and public DNS databases simultaneously in a single command. Which tool is built specifically for this type of multi-source passive email and subdomain harvesting?


Quiz

Scenario:

A junior pen tester argues that running "shodan host 89.44.12.201" counts as active reconnaissance because it returns real live data about an actual server — including its open ports, software versions, and known vulnerabilities. A senior tester disagrees. Who is correct, and why?

Scenario:

During passive recon on day one, your Shodan query returns a CVSS 9.8 remote code execution vulnerability on the target company's public-facing VPN gateway. The active scanning phase has not started yet. Your engagement is scheduled to run for two more weeks. What is the correct professional response to this finding?

Scenario:

A dig TXT query against a target domain returns this SPF record: "v=spf1 include:sendgrid.net ~all". Your team is planning a phishing simulation as part of the engagement. Your colleague asks what the ~all at the end means and whether it affects the likelihood of your test phishing emails reaching employees. What is the correct explanation?

Up Next · Lesson 9

Hacking Lab Setup

Build your own safe, legal practice environment from scratch — the exact setup used throughout the rest of this course.