Foolproof HTML escaping in Javascript
For those who don't want to read a lengthy blog post:
// Use the browser's built-in functionality to quickly and safely escape
// the string
function escapeHtml(str) {
var div = document.createElement('div');
div.appendChild(document.createTextNode(str));
return div.innerHTML;
}
// UNSAFE with unsafe strings; only use on previously-escaped ones!
function unescapeHtml(escapedStr) {
var div = document.createElement('div');
div.innerHTML = escapedStr;
var child = div.childNodes[0];
return child ? child.nodeValue : '';
}
Users can be malicious
As any good web developer knows, it's important to be constantly vigilant with the handling of user data. We avoid buffer overflows and format string exploits (remember those?) by using safer languages or being careful in our C. To avoid SQL injection, we never build database queries by concatenating user-supplied data. These measures protect the integrity of the data on our servers, but what about our (non-malicious) users?
These days, most new websites accept user input which will later be displayed to other users. This brings in a host of new issues, including XSS and CSRF attacks. The problem in both cases is that a user-supplied string might contain arbitrary HTML, including Javascript or carefully-crafted <img> tags, and if we mindlessly dump it to other users' browser, such scripts could hijack user identities or send sensitive data to an attacker. Browsers are starting to offer some defense to these attacks, but historically the onus has been on web developers to ensure that we recognize where we are sending user input to the browser and that we properly escape that data.
Escaping user input
Many developers have opinions on the proper place to escape user input. I'll review three options:
- Before storage
- On the backend, while building the HTML
- On the frontend, in the Javascript that builds HTML
Before storage
One approach is to escape data as soon as you receive it. This is widely used because it's one of the most foolproof methods: just filter the input once and forget about. The core problem with this approach is that in its usual implementation, it's essentially a one-way operation; the data that the user originally submitted is no longer retrievable.
You've probably noticed clumsy implementations on various discussion sites that allow editing: your edit window shows your text all mangled. All your < have been converted to <. Furthermore, if you don't change it back, the next edit will be &lt;. This highlights another weakness of this method: it is susceptible to double-escaping, since there is often no bookkeeping to indicate that the data has been escaped already.
I prefer to maintain the data in its original form and handle it with care elsewhere.
On the backend
For years, the languages commonly used for web development have included libraries that properly handle HTML escaping. Good developers clearly indicate in the code and documentation where user-created data exists, and they use appropriate libraries to escape all such data as it is converted into an HTML page. Barring problems in the library implementation or lapses in vigilance, this is a solid approach, and it allows a lot more flexibility than the previously-discussed method. For example, it is now possible to echo the input as-is back to the creator for editing purposes.
This is a valid approach to the problem, but as the popularity of AJAX, JSON, and Javascript widget-based rendering continues to increase, the backend approach is often not an option. We have to perform the escaping on the frontend
On the frontend
A lot of new web development today centers on dynamically-created content. We build a basic HTML frame with some Javascript that pulls in everything else and places it on the page. A user action results in an AJAX request being sent to the server with a JSON response sent back that contains data to be placed on the page. As with HTML escaping, most web languages have excellent libraries for converting arbitrary objects to JSON. All characters that are sensitive in a Javascript context are properly escaped in the conversion to JSON. But what happens when it's received on the frontend?
Escaping in Javascript
Quite often, we're receiving user-created data as part of a JSON response, and eventually we have a string of that data assigned to a Javascript variable. Now we want to build the HTML using that string. Let's call it unsafe_str
.
The unsafe way
document.getElementById("whereItGoes").innerHTML += unsafe_str;
This approach is vulnerable to every problem I outlined at the outset. Don't do this in your code! It's not much more difficult to do it in a safe way.
The safe way
document.getElementById("whereItGoes").appendChild(document.createTextNode(unsafe_str));
This uses the browser's own knowledge of which characters are sensitive to properly escape the string. It's fast, and according to quirksmode, it's supported by every browser out there. But when we're concatenating strings or building some widget via a class hierarchy, this isn't always possible. Sometimes we need to escape the string way before we add it to a DOM node. Enter various hacks to make that happen.
Hack #1: inline
var safe_str = unsafe_str.replace(/&/g, "&").replace(/</g, "<").replace(/>/g, ">");
I see this all the time, and I've been guilty of similar methods on both the backend and the frontend. You know it's inefficient, but it's trivial and it works and it's basically a one-liner. Then you notice a random bug where you converted < to > by mistake, or maybe you forgot the pesky semicolon, so you decide to create a canonical escape function. Then it turns out that you sometimes need to escape part of an HTML tag attribute. You eventually settle on something like the following.
Hack #2: the catchall
function escapeHTML (unsafe_str) {
return unsafe_str
.replace(/&/g, '&')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/\"/g, '"')
.replace(/\'/g, '''); // ''' is not valid HTML 4
}
This works pretty well. It handles the most important cases, but you know in the back of your mind that it's wasteful. You've traversed the string five times (creating five new strings!) just to return the escaped version, and you have " characters all over the page where they aren't necessary. Your programming aesthetic takes over and one evening you convert it.
Hack #3: more efficient catchall
var ESC_MAP = {
'&': '&',
'<': '<',
'>': '>',
'"': '"',
"'": '''
};
function escapeHTML(s, forAttribute) {
return s.replace(forAttribute ? /[&<>'"]/g : /[&<>]/g, function(c) {
return ESC_MAP[c];
});
}
Now you're pretty happy. You only traverse the string once. You handle escaping both within and outside of attributes. But eventually you want to un-escape the strings you've escaped. In the process of writing that function, you learn that there are 252 named entities in HTML 4, in addition to literal entities like &#dddd; (decimal) and &xhhhh; (hex). Wow, you think, there must be a better way. And then you think back to "the safe way":
document.getElementById("whereItGoes").appendChild(document.createTextNode(unsafe_str));
Can't we leverage that? We can!
The best way to escape HTML in Javascript
// Use the browser's built-in functionality to quickly and safely escape
// the string
function escapeHtml(str) {
var div = document.createElement('div');
div.appendChild(document.createTextNode(str));
return div.innerHTML;
}
// UNSAFE with unsafe strings; only use on previously-escaped ones!
function unescapeHtml(escapedStr) {
var div = document.createElement('div');
div.innerHTML = escapedStr;
var child = div.childNodes[0];
return child ? child.nodeValue : '';
}
The browser already knows how to escape strings; the document.createTextNode
method does it. We can take advantage of this to make string escaping fast, safe, and dead-simple. You can never beat the browser's builtin (often C++) code with your Javascript. So don't reinvent the wheel; let the browser solve the problem for you!
Acknowledgments
Many thanks to Big Dingus and ceefour who provided the inspiration for most of the code on this page.